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PREFACE 



Three and a half years have passed since I started this research on March 1, 1987. During this period, 
so many things were designed and accomplished, and as the principal investigator I find it extremely 
difficult to include and systematise all the important findings and implications within a single final 
report. It is my regret that many of them have to be left out, but I did my best within a limited 
amount of time with the hope that this final report will help the reader to grasp the outline of the 
whole accomplishment. 

There were five main objectives in the original research proposal, and they can be summarised as 
follows. 

[1] Further investigate the nonparametric approach to the estimation of the operating char- 
acteristics of discrete item responses. 

[2] Revise and strengthen the package computer programs and eventually implement them in 
the Unix Operating System. 

[3] Investigate an ideal computerised adaptive testing procedure and eventually materialize it 
in the SUN microcomputer system networked with IBM personal computers. 

[4] Investigate multidimensional latent trait theory. 

[5] Pursue item validity and test validity using the multidimensional latent space. 

Out of these objectives, Objectives [l] and [5], together with Objectives [2] and [3], were most intensively 
pursued. The highest productivity belongs to this part of the research, which provides us with valuable 
future perspectives of research. 

During the research period there were many people who helped me as assistants, secretaries, etc., 
as I acknowledged in each research report. Also people of the Office of Naval Research, especially Dr. 
Charles E. Davis, and those of the ONR Atlanta Office, including Mr. Thomas Bryant, have been of 
great help in conducting the research. I would like to express my gratitude to all of them. 

Thanks are also due to my assistants, Nancy H. Domm and Raed A. Hijer, who helped ne in 
preparing this final report. Appreciation is also extended to my former assistants, Christine A. Golik 
and Philip S. Livingston, who still helped me occasionally during the research period. 

September 20, 1990 
Author 
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I Introduction 



This is the final report of the multi-year iesearch project entitled Validity Study in Multidimensional 
Latent Space and Efficient Computerized Adaptive Testing, which was sponsored by the Office of Naval 
Research in 1987 through 1990 (N00014-87-K-0320). The accomplishments include those which have 
already been published as ONR research reports as well as those still in progress, which will be published 
in later years as part of more comprehensive research results. 

The rest of this chapter will describe papers published or presented during the research period, and 
related events. The contents of the research accomplishments will be summarized and systematized, 
and will be described in the succeeding chapters. 

[Ll] Research Reports 

The following are the ONR research reports that have been published in the present research project. 

(1) Modifications of the Test Information Function. Office of Naval Research Report 90-1, 
1990. 

(2) Predictions of Reliability Coefficients and Standard Errors of Measurement Using the Test 
Information Function and its Modifications. Office of Naval Research Report 90-2, 1990. 

(3) Validity Measures in the Context of Latent TVait Models. Office of Naval Research Report 
90-*, 1990. 

(4) Differential Weight Procedure of the Conditional P.D.F. Approach for Estimating the 
Operating Characteristics of Discrete Item Responses. Office of Naval Research Report 
90-4, 1990. 

(5) Content-Based Observation of Informative Distractors and Efficiency of Ability Estimation. 
Office of Naval Research Report 90-5, 1990. 

[1.2] Special Contribution Paper 

During this period, with the request of Dr. Chikio Hayashi, president of the Behaviormetric Society, 
a special contribution paper entitled Comprehensive latent Trait Theory was written and published in 
Bfhaviormetrikaj Vol. 24, 1988. The paper is based upon the invited address, a one hour special lecture 
overviewing latent trait models, which was given at the 1987 Annual Meeting of the Behaviormetric 
Society in 1987 at Kyushu University, Fukuoka, Japan, under the title, Overview of Latent Trait Models. 
There were more than two hundred researchers in the audience, and the summary of the paper is given 
as Appendix B of the author's ONR Final Report: Advancement of Latent TVait Theory, which was 
published in 1988. 

[1.3] Paper Presentations at Conferences 

There are thirteen papers presented at conferences during this research period, excluding those in 
1987 which have been reported in * Final Report: Advancement of Latent TVait Theory." They include 
ONR contractors' meetings, and are listed below. 

(1) A Robust Method of On-Line Calibration, American Educational Research Association 
Meeting, New Orleans, 1988. U. S. A. 

(2) Some Modifications of the On-Line Item Calibration Methods. ONR Conference on Model- 
Based Measurement, Iowa City, 1988. U. S. A. 



(3) Information Functions of the General Model Developed for Differential Strategies and Pos- 
"3" *° r *PP'S"*»? Half-Discrete, Half-Continuous Models for Projective Techniques. 
ONR Conference on Model-Based Measurement, Iowa City, 1988. U. S. A. 

(4) Some Refinement in the Estimation of the Operating Characteristics of Discrete Item Re- 
sponses without Assuming any Mathematical Form. Psychometric Society Meeting Los 
Angeles, 1988. U. S. A. *' 

(5) Prospect of Analysing Rorschach Data by Sophisticated Psychometric Methods. Sympo- 
sium: The Burstein-Loucks Rorschach Scoring System: Clinical and Psychometric De- 
velopments. American Psychological Association Annual Meeting, Atlanta, 1988. U. S. 

A. 

(6) Utent Trait Approach to Rorschach Diagnosis Based upon the Burstein-Loucks Scoring 
System. American Educational Research Association Annual Meeting, San Francisco, 1989. 
U. £>. A. (round-table session) 

(7) Some Considerations on Validity Measures in Latent Trait Theory. ONR Conference on 
Model-Based Measurement, Norman, OK, 1989. U. S. A. 

(8) Differential Weight Procedure of the Conditional P.D.F. Approach in the Estimation of 
Operating Characteristics of Discrete Item Responses. ONR Conference on Model-Based 
Measurement, Norman, OK, 1989. U. S. A. 

(9) Some Reliability and Validity Measures in the Context of Utent Trait Models. Psychome- 
tric Society Annual Meeting, Los Angeles, 1989. U. S. A. 

(10) Prospect of Applying Latent Trait Models and Methodologies Accomodating Both Psycholog- 
ical and Neurological Factors. American Educational Research Association Annual Meet- 
ing, Boston, 1990. U. S. A. 

(11) Reliability/Validity Indices in the Context of Latent Trait Models. American Educational 
Research Association Annual Meeting, Boston, 1990. U. S. A. 

(12) Further Considerations for the Differential Weight Procedure of Estimating the Operating 
Characteristics of Discrete Item Responses. ONR Conference on Model-Based Measure- 
ment, Portland, OR, 1990. U. S. A. 

(13) Modified Test Information Functions, Their Usefulnesses and Prediction of the Test Reli- 
ability Coefficient Tailored for a Specific Ability Distribution. ONR Conference on Model- 
Based Measurement, Portland, OR, 1990. U. S. A. 

[1.4] Other Events 

i M Q ,h< »T , $ > ? in 75 i « ator «*^ a. eminar entitled Comprehensive Utent Trait Models in Sep .ember, 
1989, at the National Center for University Entrance Examination, Tokyo, Japan, invited by Dr. Sbuichi 
Iwatsubo of the Center and Dr. Kasuo Shigematsu of the Tokyo Engineering University. 

♦^rl^rt-' T"*? c 1 olUboration8 with Profe«or Sukeyori Shibaof the University of Tokyo, and 
with Dr. Takahiro Sato of the C k C Information Technology Research Laboratories of Nippon Electric 
Company, Japan. r * 



II Backgrounds and Basic Concepts Used throughout the Re 
search 



In this chapter, the backgrounds and the basic concepts upon which the present research has been 
conducted are introduced. The reader is directed to the auther'c two previous ONR final reports 
(Samejima, 1981b, 1988) and other ONR research reports, if he/she wants to know these concepts and 
developments in more detail. 

[II. 1] General Concepts in Latent Trait Models 

Let $ be ability, or latent trait, which assumes any real number. Let 0 (= 1, 2, • • • , n) denote an 
item, k g be any discrete item response to item g , and Pk 9 {8) denote the operating characteristic of 
kg , or the conditional probability assigned to kg , given 8 , i.e., 

(2.1) Pk,{e) = prob.\kg\e\ . 

We assume that Pk § [0) i* three-times differentiate with respect to 8 . We have for the item response 
information function (Samejima, 1972) 

(2.2) l k ,(6) = _^logP fcf (tf) = [Aft, (#) {P k ,(6)}->\> - l^ft. (#) WW 1 - 

and the item information function is defined as the conditional expectation of Ik 9 {8) , given 8 , such 
that 

k e km 



(2.3) 1,(6) = E[I k ,(6) | 6} = J>, (*)P fcf (*) » £ \jjPk.e)\W)Y 



In the special case where the item g is scored dichctomouely, this item information function is simplified 
to become 

(24) 1,(6) = [j- g P g (6)] 2 \{P.(')Hi ~ PA*))}- 1 . 

where P g (8) denotes the operating characteristic of the correct answer to item g . 
Let V be a response pattern such that 

(2.5) V = {kg}' 0=1,2 n 

The operating characteristic, Pv{8) % of the response patten V is defined as the conditional probability 
of V , given 8 . Throughout this report the principle of local independence is assumed to be valid, 
so that within any group of examinees all characterised by tb<? £*rne value of the latent variable 8 
the distributions of the item response categories are all independent of each other. Thus the operating 
characteristic of a given response pattern is a product oi ;he operating characteristics of the item 
response categories contained in that response pattern, so that we can write 

(2-6) Pv(6)=Y[ Pk,(6) . 

k.iV 
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The response pattern information function, I v (6) , (Samejima, 1972) is given by 
< 2 - 7 ) M#)-~bflV(#)- • 

k t iV 

and the fit information function, 1(6) , ii defined as the conditional expectation of I v 18) riven 6 
and we obtain from (2.2), (2.3), (2.5), (2.6) and (2.7) U ' * ' 

< 2 - 8 ) /(«) = £[/v(«)|tf] = £ *(*)*(') . 

[II.2] Critical Observations of the Reliability, Standard Error of Measure- 
ment and Validity of a Test 

The reliability coefficient and the standard error of measurment in classical mental test theory are two 
concepts that have widely been accepted and used by psychologists and test users in the past decades 
The author has pointed out repeatedly, however, that these measures are actually the attributes of a 
specified group of examinees as weU as of a given test. In addition, even if we take this fact into account 
representation of these measures by single numbers results in over-simplification and the lack of useful 
information for both theorists and actual users of tests. In contrast to this, in latent trait models, 
the item and test information functions, which are defined by (2.3) and (2.8), respectively, provide us 
with abundant information about the local accuracy of estimation, a concept which is totally missing 
in classical mental test theory. These functions are population-free, i.e., they do not depend upon any 
specific group of examinees as the reliability coefficient and the standard error of measurment do. 

Unlike the progressive dissolution of test reliability, test validity is one concept that has rather 
been neglected in the context of latent trait models. Several types of validity have been identified and 
discussed in classical mental test theory, which include content validity, construct validity, and criterion- 
oriented validity. Perhaps we can say that, in modern mental test theory, both content validity and 
construct validity are weU accomodated, although they are not explicitly stated. If each item is based 
upon cognitive processes that are directly related to the ability to be measured, then the content of 
the operationally defined latent variable behind the examinees' performances will be validated. Also 
construct validity can be identified, with all the mathematically sophisticated structures and functions 
which characterise latent trait models and which classical mental test theory does not provide. With 
respect to the criterion-oriented validity, however, so far latent trait models have not offered so much 
as they did to the test reliability and to the standard error of measurement. 

In classical mental test theory, the validity coefficient is again a single number, i.e., the product- 
moment correlation coefficient between the test score and the criterion variable. Since the correlation 
coefficient is largely affected by the heterogeneity of the group of examinees, i.e., for a fixed test the 
coefncient tends to be higher when individual differences among the examinees in the group are greater 
and vice versa (cf. Samejima, 1977b), we must keep in mind that so-called test validity represents the 
degree of heterogeneity in ability among the examinees tested, as well as the quality of the test itself. 

[II.3] Nonparametric Approach to the Estimation of the Operating Char- 
acteristics of Discrete Item Responses 

As early as in 1977 the author proposed Normal Approximation Method (Samejima, 1977b) which 
can be used for item calibration both in computerised adaptive testing and in paper-and-pencil testing, 
bhe also discussed the effective use of information functions in adaptive testing (Samejima, 1977a) 
Since then, with the support by the Office of Naval Research, she has developed several approaches and 
methods for the same purpose (cf. Samejima, 1977c, 1978a, 1978b, 1978c, 1978d, 1978e, 1978f, 1980a 
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1980b, 1981&, 1981b, 1988; Samejima and Changas, 1981). For convenience, they can be categorised as 
follows. 

Approaches Methods 

(1) Bivariate P.D.F. Approach (1) Pearson System Method 

(2) Histogram Ratio Approach (2) Two-Parameter Beta Method 

(3) Curve Fitting Approach (3) Normal Approach Method 

(4) Conditional P.D.F. Approach (4) Lognormal Approach Method 

(4.1) Simple Sum Procedure 

(4.2) Weighted Sum Procedure 

(4.3) Proportioned Sum Procedure 

Here by an approach we mean a general procedure in approaching the operating characteristics of a 
discrete item response, and by a method we mean a specific method in approximating the conditional 
density of ability, given its maximum likelihood estimate. Thus a combination of an approach and a 
method provides us with a specific procedure for estimating the operating characteristic of a discrete 
item response. 

These approaches and methods are characterised by two features, i.e., 

(1) estimation is made without assuming any mathematical forms for the operating 
characteristics of discrete item responses, and 

(2) estimation is efficient enough to base itself upon a relatively small set of data of, say, 
several hundred to a few thousand examinees. 

The backgrounds common to the Bivariate and Conditional Approaches and the differences among 
different methods can be described as follows. For the sake of simplicity in handling mathematics, the 
tentative transformation of $ to r is made by 

(2.9) r = Of 1 f \I(t)\ l "dt + C 0 , 

• -00 

where C 0 is an arbitrary constant for adjusting the origin of r , and C\ is an arbitrary constant 
which equals the square root of the test information functions, I*[r) , of r , so that we can write 

(2-10) C^II'W* 

for all r . This transformation will be simplified if we use a polynomial approximation to the square 
root of the test information function, [/(0)] 1 '* , in the least squares sense which is accomplished by 
using the method of moments (cf. Samejima and Livingston, 1979) for the meaningful interval of r . 
Thus (2.9) can be changed to the form 



(2.n) r = c; l jTa k (k + i)- l e k + l + c 0 

kmO 

m+1 



fc=0 



where a* (k » 0, 1, . . . , m) is the k -th coefficient of the polynomial of degree m approximating the 
square root of 7(0) , and aj is the new k -th coefficient which is given by 
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(2.12) 




fc=0 

= (C 1 fc)- 1 Q fc _ 1 fc = 1,2 m+1 . 



With this ^transformation of 6 to r and by virtue of (2.10), we can use the asymptotic normality 
with the two Parameter., r and Cf> , a. the approximation to the conditional distribution of the 

foutZ?n^.T Mtlm i at ° r r ' g,Ve " it8 . trUC ^ W T (cf - S * mejim *' 1981b )' Then * he *■* * 
fourth conditional moment, of r , given f , can be obtained from the density function, g'tf) of f 

and from the constant C, by the following four formulae (cf. Samejima, 1981b): 



(«•") £(r | f) = ? + Cr 2 -£ logoff) , 



df 



< 214 > V-r.fr I f) = CT 2 [1 + Cf>£ log,-(f)| , 

(215) E K T - e ( t l f )} 3 1 f] = C r 8 [^ logoff)] 

(2.16) «r-JP(r|r)r|f| = C r <[3 + 6C-(^ logoff)} + sCf-xiL log,' (f)}« 

+ Cf < {^ f logoff)}] . 

This density function g-(f ) , can b estimated by fitting a polynomial, using the method of moments 
(cf. Samejima and Livingston, 1979), a. we did in the transformation of # to r , based upon the 
empirKal set of f <s Note that in the above formulae the first moment is about the origin S h 
other three are about the mean. 

The two coefficients, /?, and 0 3 , and Pearson's criterion k are obtained by 

and 

(219 > « = + 3) 3 [4(2/? 2 - 3A - 6)(4ft - Sft))" 1 , 

by substituting /2 , Ms and » 4 by Var.(r | f) , E({r - U(r | f)}' | f] and E[{r - £(r I ?)}« I ?] 
respectively, which are obtained by formulae (2.14), (2.15) and (2.16). 

Iau?i ; he t Bivari ^ e . t PD F -. A PP r 1 0 L ac , h ' we Woximate the bivariate distribution of the transformed 
latent trait r and its maximum likelihood estimate f /or each subpopulation of examinees »ho share 
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the same discrete item response to a specified item. Thus the procedure must be repeated as many times 
as the number of discrete item response categories for each separate item. It is rather a time-consuming 
approach, and the CPU time for the item calibration increases almost proportionally to the number of 
new items. 

In contrast to this, Conditional P.D.F. Approach deals with the total population of subjects, and 
all the items together. Effort is focused upon the approximation of the conditional distribution of r , 
given f , for the total population of examinees, and then the result is branched into separate discrete 
item response subpopulations for each item. 

If we compare the two approaches with each other, therefore, we can say that Bivariate P.D.F. 
Approach is an orthodox approach, while Conditional P.D.F. Approach needs an assumption that the 
conditional distribution of r , given f , is unaffected by the different subpopulations of examinees. 
While this assumption can only be tolorated in most cases, the latter approach has two big advantages in 
the sense that the CPU time required in item calibration is substantially less, and that it does not have 
to deal with subgroups of small numbers of subjects in approximating the joint bivariate distributions 
of r and f . 

In each of these two approaches, we can choose one of the four methods listed earlier in estimating 
the bivariate density of r and f , or the conditional density of r , given iu maximum likelihood 
estimate f . In so doing, in the Pearson System Method, we use all four conditional moments of 
r , given f , which are estimated through the formulae (2.13) through (2.16), and, using Pearson's 
criterion k , which is given by (2.19), one of the Pearson System density functions is selected. In the 
Two-Parameter Beta Method two of the four parameters of the Beta density function, i.e., the lower 
and upper endpoints of the interval of r for which the Beta density is positive, are a priori given, and 
the other two parameters are estimated by using the first two conditional moments of r , given f , 
which are provided by (2.13) and (2.14), respectively. In the Normal Approach Method, again we use 
only the first two conditional moments of r , given f , as the first and second parameters of the normal 
density function. 

If we compare these three methods, it will be appropriate to say that both Two-Parameter Beta 
Method and Normal Approach Method are simpler versions of Pearson System Method. And yet the 
latter two methods have an advantage of using only the first two estimated conditional moments of 
r , given f , whereas the former requires the additional third and fourth conditional moments, whose 
estimations are less accurate compared with those of the first two conditional moments. If we compare 
the Two-Parameter Beta Method with the Normal Approach Method, we will notice that the former 
allows non-symmetric density functions, while the latter does not. This is an advantage of the Two- 
Parameter Beta Method over the Normal Approach Method, and yet the former has the disadvantage 
of the requixement that two of the four parameters should a priori be set. 

Lognormal Approach Method was developed later, which uses up to the third conditional moment 
and allows more flexibilities in the shape of the conditional distribution of r , given f , than the Normal 
Approach Method. It was intended that a happy medium between the Pearson System Method and the 
Normal Approach Method would be realised, in the effort of ameliorating the disadvantages of these 
two metnuds and of keeping their separate advantages. 

[II.4] Possible Non-Monotonicities of the Operating Characteristics 

As early as in 1968 the author wrote about and discussed the conceivable non-monotonicity of the 
operating cL tract fistic of the correct answer of thr multiple-choice test item, which is based strictly 
upon theory (cf. Samejima, 1968). Since then, such a phenomenon has actually been observed with 
empirical data. For example, Lord and Novick reported such a curve when they plotted the percent of 
the correct answer against the test score for each item as an approximation to the item characteristic 
function (cf. Lord and Novick, 1968, Chapter 16). Since, as their Theorem 16.4.1 states, the average, 
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over all items, of the sample item-test regression, falls along a straight line through the origin with 
fortv-fiwt degree slope, such a dip cannot be detected for an easy item even if it exists, as far as we use 
the iteK-test regression as an approximation. It is quite possible, therefore, that there are more than 
on* item among those items that have such dips; only they were not detected. 

In the past years various sets of data based upon the Vocabulary Subtest of the Iowa Tests of Basic 
Skills, upon Shiba's Word/Phrase Comprehension Tests, ASVAB Tests of Word Knowledge and of Math 
Knowledge, etc., have been analysed by using, mainly, the Simple Sum Procedure of the Conditional 
P.D.F. Approach combined with the Normal Approach Method (cf. Samejima, 1981b).These tests 
consist of multiple-choice test items, with four or five alternative answers in each item. As the result, 
we have discovered . a-monotonk operating characteristics of the correct answer for some of the items 
as well as differential information coming from the estimated operating characteristics of the incorrect 
alternative answers, which are called plausibility functions. 

Such discoveries of non-monotonic operating characteristics can best be accomplished by using a 
nonparAmetric approach to the estimation of the operating characteristics. After the operating charac- 
teristics . hcve been discovered by using the nonparametric approach, however, it may be wise to search 
for mathematiral model, that fit the results, and to estimate item parameters accordingly, so that we 
shall be able to taao advantage of the mathematical simplicity coming from the parameteriiation 
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Ill Proposal of Two Modification Formulae of the Test Infor- 
mation Function 

Although the reciprocal of the test information function 1(9) provides us with a minimum variance 
bound for any unbiased estimator of $ (cf. Kendall and Stuart, 1961), since the maximum likelihood 
estimate, which is denoted by 6y , is only asymptotically unbiased, for a finite number of items we 
need to examine if the bias of Oy of a given test over the meaningful range of $ is practically nil, 
before we consider this reciprocal as a minimum variance bound. It has been shown (Samejima, 1977a, 
1977b) that in many cases the conditional distribution of By , given B , converges to N(6, [I(6)]" 1 ^ 7 ) 
relatively quickly. On the other hand, we have also noticed that the speed of convergence is not the same 
even if the amount of test information is kept equal. This has been demonstrated by using Ccastant 
Information Model (Samejima, 1979a), which is represented by 

(3.1) F g (B) = *in 3 [a g (6-b g ) + (*/i)\ , 

where, as before, P g (B) denotes the operating characteristic of the correct answer, and a g (> 0) and 
b g are the item discrimination and difficulty parameters, respectively. This model provides us with a 
constant amount of item information I g (B) which equals 4a g for the interval of B , 

(3.2) - '[4a,]- 1 + b g < B < *[ia g }- 1 + b g 
(cf. Samejima, 1979b). 

Thus two modification formulae of the test information function 1(B) have been proposed in the 
present research in order to provide better measures of local accuracies of the estimation of B , when 
the maximum likelihood estimation is used. They start from the search for a minimum variance bound, 
and from a minimum bound of the mean squared error, of any estimator, biased or unbiased. 
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[III.l] Minimum Variance Bound 

Let By denote any estimator of 6 . We can write in general 

( 3 - 3 ) W 1')=* + E\W -f)|f] . 

When the item responses are discrete, we have 

( 3 -<) *w i «) = £ * M») = E * iv(#) , 

V V 

where lv(0) denotes the likelihood function. Differentiating both sides of (3.4) with respect to 9 
we obtain ' 



(3-5) ±w io = £ E * *(#)] = E *i=*w] 

V V 

We can write 

and using this we can rewrite (3.5) into the form 



(3.7) _ W IIJ^W- I #)1 l£ !**(#)] *(#) 



FVom this result, by the Cramer-Rao inequality, we obtain 



( 3 8 ) I 0] 3 < Var.(9' v | f) J5j{ |rlog JV(*)> 3 | 9) 



Since we can write 

from this, (2.7), (2.8) and (3.3) we can rewrite and rearrange the inequality (3.8) into the form 
(3.10) Var\9- V \ 9) > [±E(9 V | #)]> [/(*)]-» = [i + _ , | e)] 2 > 

whose lightest hand side provides us with the minimum variance bound of the conditional distribution 
of any estimator . When 1} is biased, the sise of the minimum variance bound is determined by 
the second term of the first factor of the minimum bound, and the result can be greater or less than 
the reciprocal of the test information function depending upon the sign of this partial derivative. 
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[III.2] First Modified Test Information Function 

Lord has proposed a bias function for the maximum likelihood estimate of 6 in the three-parameter 
logistic model whose operating characteristic of the correct answer, P g (8) , is given by 

(3.11) P e (») = e t + (l-e t )[l + txp{-Da a (6-b a )}}- 1 , 

where a g , b g , and c g are the item discrimination, difficulty, and guessing parameters, and D is a 
scaling factor, which is set equal to 1.7 when the logistic model is used as a substitute for the normal 
ogive model. Lord's biaa function B(6y \ 6) can be written as 

(3.12) B(t v | 8) = D|/W|- 5 EV.(«)W«) - |l . 
where 



(3.13) = [l + txpi-D^e-b,)}}- 1 

(cf. Lord, 1983). We can see in the above formula of the MLE bias function that the Has should be 
negative when i> Q (Q) is leu than 0.5 for all the items, which is necessarily the case for lower values of 
6 , and should be positive when i> g (Q) is greater than 0.5 for all the items, i.e., for higher values of 
6 , and in between the bias tends to be close to sero, for the last factor in the formula assumes negative 
values for some items and positive values for some others, provided that the difficulty parameter b g 
distributes widely. 

In the general case of discrete item responses, we obtain for the bias function of the maximum 
likelihood estimate (cf. Samejima, 1987) 



(3.14) B(§ v \6) = E\e v -6\6\ = Hmim^tl^A 9 )^) 

u=l fc, 

= -(i/4)[/(#)]- a E5;is i (#)i^(#)[ii i w]- 1 , 

o=i fc, 

where Ak 9 (B) is the basic function for the discrete item response k g , and P^(6) and f^ 9 (B) denote 
the first and second partial derivatives of P* f (6) with respect to 6 , respectively. On the graded 
response level where item score x g assumes successive integers, 0 through m g , each k g in the 
above formula must be replaced by the graded item score x g (cf. Samejima, 1969, 1972). On the 
dichotoinous response level, it can be reduced to the form 

(3.15) B(h\e) = e\6 V -6\6\ = (-immr'^ww'mpw)}- 1 . 

0 =i 

with Pg(6) and Pg[8) indicating the first and second partial derivatives of P g (8) with respect to 
8 , respectively. This formula includes Lord's bias function in the three-parameter logistic model as a 
special case. 

We can rewrite the inequality (3.10) for the maximum likelihood estimate 8y 
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(3-16) Var.(Sv \ 8) > (1 + ±B(S V | #)]» [1(6)]^ . 

Taking the reciprocal of the right hand side of (3.16), which i, an approximate minimum variance bound 
of the maximum likelihood estimator, a modified test information function, T(tf) , is proposed by 

From this formula, we can see that the relationship between this new function and the original test 
information function depends upon the first derivative of the MLE bias function. If the derivative is 
positive, then the new function will assume a lesser value than the original test information function- 
if it is negative, then this relationship will be reversed; if it is sero, i.e., if the MLE is unbiased, then 
these two fa ° C * w "^.-"™« •«« value. We can write from (3.14) for the general form of the 
derivative of the MLE bias function 

(3.18) §- e B(6 v \6) = {I(6))- l \(l/2){I{6)}- 1 



n 

ED\(^;(«) - n,(»)PLy){Pk,(e)}-*) - 2 B(e v \ e)i'(e)} , 

0=1 fcf 

where P»(8) and I'(B) denote the third and the first derivatives of P ks (6) and 1(6) with respect 
to 6 , respectively. It is obvious from (2.3) and (2.8) that we have 

(319 > w = E n,(m'.vm.(t)}- 1 - /*.(#)] 

*» 

and 

(3.20) m = £ m = n,(e)\K,[6){P k ,(e) } - 1 - KW\ , 

8=1 0=1 k, 

where /'(*) is the first derivative of the item information function 1,(6) with respect to 6 . For a 
set of dichotomous items (3.18) becomes simplified into the form 

(3.21) ±B(6 V | 6) = {/(#)}- l [(l/2){/(#)>-» £>„(*)} "'{l - P„(*)}-> 

0=1 

({1 " 2P B (*)}{ W) 2 W - P g (6){l - P B («)}({^'(tf)} 2 + P' t (9)P'»(9))) 
- 2B($v j «) r(6)\ , 

where B(6 V \ 6) is given by (3.15). 

[III.3] Minimum Bound of the Mean Squared Error 

When the estimator 0' is conditionally biased, however small the conditional variance may be it 
does not reflect the accuracy of estimation of 6 . Thus the mean squared error, E\(6* v - 6) 2 I 6) 
becomes a more important indicator of the accuracy. We can write for the mean squared error 



0 

ERJC 



12 

J9 



(3.22) E[{»t - B) 2 | 8] » Var\6l \ B) + \E(fy \ B) - B} 2 

(cf. Kendall and Stuart, 1961). We can see in this formula that the mean squared error equals the 
conditional variance if 9$ is unbiased, and is greater than the variance when f£ is biased. Prom this 
and the inequality (3.10) we obtain for the minimum bound of the mean squared error 

(3.23) E[{01 - B) 2 | 0] > [1 + Jj W - 6 | 9)) 2 [1(e)]- 1 + [EM \ 6) - 6\ 2 . 
Note that this inequality holds for any estimator, By , of $ . 

[IIL4] Second Modified Test Information Function 

For the maximum likelihood estimate *V i we can rewrite the inequality (3.23) by using the MLE 
bias function, which is given by (3.14), to obtain 

(3.24) E[($ v - B) 2 | #1 > [1 + jgBity | B)) 2 [I(0)\^ + [f?(*V | *)]' . 

Taking the reciprocal of tho right hand side of (3.24), which is an approximate minimum bound of 
the mean squared error of the maximum likelihood estimator, the second modified test information 
function, S(0) , is proposed by 

(3.25) E(B) = 1(B) {[1 + ±B(S V \ B)) 2 + 1(B) [B(S V \ B)) 2 }-* . 

We can see that the difference between the two modification formulae of the test information function, 
which are defined by (3.17) and (3.25), respectively, is the second and last term in the braces of the 
right hand side of the formula (3.25). Since this term is nonnegative, there is a relationship 

(3.26) S(*)<T(0) , 

throughout the whole range of B , regardless of the slope of the MLE bias function. If there is a range 
of B where the maximum likelihood estimate is unbiased, then we will have for that range of B 

(3.27) S(0) = T(0) = 1(B) . 

Since under a general condition the maximum likelihood estimator Sy is asymptotically unbiased, as 
the number of items approaches positive infinity, (3.27) holds asymptotically for all B . 

[IIL5] Examples 

Samejima has applied formula (3.15) for the MLE bias functions of the Iowa Level 11 Vocabulary 
Subtest and Shiba's Test Jl of Word/Phrase Comprehension, based upon the set of data collected for 
2,356 and 2,259 subjects, respectively. These tests have forty-three and fifty-five dichotomously 
scored items, respectively, and following the normal ogive model, whose operating characteristic for the 
correct answer is given by 
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(3.28) 



or on the 



it b^fro^s n/^Trnti.!* - * " m °r <0Be " " th °" " Figure 3-1, 

between. TT,e same applies to eV) .^h^l^X^ " " ^ * 

(3 29) S(f) < T(f) < 1(0) , 

throughout the whole range of $ . 

In the normal ogive model, differentiating (3.28) twice with respect to $ and rearranging, we obtain 
(SS0) ^^ = N- 1/3 a a exp[-(l/2)aJ(J- 6B n 



and 
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(3.31) 



Substituting (3.30) and (3.31) into (3.15) and rearranging, we can wri*e for the MLE bias function 
following the normal ogive model on the dichotomout response level 

(3.32) B($y | 6) = (1/2) [/(«)]-> £ a](6 - 6,) I g (B) . 

0=1 

Differentiating (3.32) with respect to 0 , we obtain 

(3.33) §jB(lv\t) = [ /(tf)]- 2 |(l/2) £ aj[/i(fl)(fl-t a ) + /,(#)) 
It is obvious from (2.4), (2.8) and (3.31) that wo have 

(3.34) r„(6) = I g (6) [/*(«) {2P g (6) - 1} (P g (6){l - P^fl)})" 1 - 2a)(d - b g )\ 
and 

(3.35) r(B) = £ I g (6) \F g (e) {2P tt (6) - 1} (P g (6){l - /*,(*)})-* - 2a*(* - 6„) 1 . 

Figure 3-2 shows the square roots of the original and the two modified test information functions 
for the Iowa Level 11 Vocabulary Subtest and for Shiba's Test Jl of Word/Phrase Comprehension, 
following the normal ogive model In each of these figures, the curves respresenting the results of the 
two modification formulae assume lower values than the square root of the original test information 
function for all 6 , as was expected from the shape of the MLB bias function in Figure 3-1. The 
discrepancies between the results of the two modification formulae are small, however, in each figure. 

In the three- parameter logistic model, the operating characteristic of the correct answer is given by 
the formula (3.11), and Lord's MLE bias function for the three-parameter logistic model, which is given 
by (3.12), is readily applicable. Differentiating (3.11) three times with respec to 6 and rearranging, 
we can write 

(3 36) P g (B) = (1 - c e ) Da g U») U " U*)\ . 



(3.37) Pj'(tf) = (1 - c g ) D'al * 8 (6) [l - *,(#)] [1 " = Da g P' a (6)\\ - 2^(0)) 

and 

(3-38) P t »{6) = D 7 a g P»[l - H 9 («) + H^(0)) 2 \ , 
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Square Roots of the Original (Solid Line) and the Two Modified (Dashed and Dotted Lines) 
Test Information Functions of the Iowa Level 11 Vocabulary Subtest, and Those of Shiha's 
Test Jl of Word/Phrase Comprehension, Following the Normal Ogive Model. 
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where $ g (6) is defined by (3.13). Substituting (3.36) into (2.4) and rearranging, we obtain for the 
item information function 

(3.39) 1,(6) = Z>\ 3 {^(tf)} 3 [1 -*(#)! [c 9 + (l-c 9 ) *(6)\- 1 • 

This and (2.8) will enable us to evaluate Lord's MLE bias function given by (3.12). Differentiating 
(3.12) with respect to 6 and rearranging, we can write 

(3.40) §e B ( § y\ e ) - a (WE «• W{*.(')-(i/2)> 

- 2 /*(*) {/(J)}" 1 £ a 9 I g (6) {^(6) - (1/2)}] . 

We also obtain from (2.4), (3.11) and (2.8) the first derivatives of the item and the test information 
functions with respect to 6 so that we have 

(3.41) r g (6) = (l-c„)Z) 3 aJ{^(tf)} 3 [l-^(tf)]{P s («)}-' 

[2 - 3*,(0) - (1 - c g ) r/, g (6){l - rl> g (6)}{P g (6)}->) 

= Da g I a (8) [2(1 - 4, g (8)} - ^(dHP^e)}- 1 ) 

and 

n 

(3.42) /'(*)=/? £ a./^^l-^J-^JiP,^)}- 1 ], 

9=1 

and we can use these two results in (3.40) in order to evaluate Jj£(0v \ 8) . 

When c g = 0 , i.e., for the original logistic model on the dichotomous response level, these formulae 
become much more simplified, and we can write 

(3.43) P g (6) = [l+expf-Da^-ft,)}]- 1 = 0„(tf) , 

(3.44) P 9 (6) = Da g rl> g (8) [1 - 0„(fl)] , 

(3.45) P^6) = D'al ^(6) [1 - [1 - 2^(tf)] = Da g p- g (6)\\ - 20,(0)] , 

(3.46) P>>'(6) =Z? 3 a^„(tf) [1-^)] [l-60 s (tf)+6{^(tf)} 3 | , 
(3-47) I g (6) =Z? 3 a 3 ^(tf)[l-^(«)] , 
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MLE Bia. Functions of the Hypothetical Test of Thirty-Five Graded Test Item. Following 
the Normal Ogive Model (Solid Line) and the Logistic Model (Dashed Line). 



(3.48) r,(t) = D>aJ * t {t)\l - *,[t)][l - 2 * # (*)j = D a, 7,(0) [1 - 2 ^(tf)] , 

and 

(3 - 50) m =d£ «,/,(,) [i-2^)] , 

T^nLTt!^ tWO ™ 0< ^ ed t *?* functions, T(#) and B(#) , which are defined 

SL£j2jrL ( t yj^lSr*** b0 * fc for the <"i8i»»l logistic model and for the 

tnrte-paruneter logistic model 

th.^JI^J'J^ * W** 10 - 1 (* S™*™' 1W0) for the MLE bias function, and 
th. square root, of the original and the two modified test information function, of the Iowa Level 11 

ESSS.^ ? / f hib :i TMt Jl * W0rd / Ph ™ Comprehension, following the fe^ti model 
to 2 filZT.^ ^ P»* m L e *r " d by M " in « D m 1-7 - These result, are similar 

to thou following the normal ogive model, which are printed by Figure. 3-1 and 3-2, except that 
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the square roots of the original and the modified test information functions are a little steeper, the 
characteristic of the logistic model in comparison with the normal ogive model. 

In the homogeneous case of the graded response level (Samejima, 1969, 1972), the general formula 
for the operating characteristic of the item score x g (= 0, 1, m g ) is givan by 

(*- 51 ) = KM) - J? f+1 (#) , 

where 

. , /••»(•-».,) 

( 3 -52) P;,(6)= +.{*)*, 

J -OO 

( 3 - 53 ) - OO = to < *1 < *2 < ... < 6 mf < 6 mf + i = OO , 

and t g (t) is some specified density function. When we replace the right hand side of (3.52) by that of 
(3.28) with 6, replaced by 6, f and use the result in (3.51), we have the operating characteristic of 
x g in the normal ogive model on the graded response level; when we do the same thing using the right 
hand side of (3.13), we obtain the operating characteristic of x g in the logistic model on the graded 
response level. 

A hypothetical test of thirty-five graded items, with three graded score categories each, which gives 
an approximately constant amount of test information for the interval of 9 , (-3, 3), has been used 
repeatedly in the author's research (cf. Samajima, 1981, 1988). Figure 3-3 presents the MLE bias 
functions for this hypothetical test, following the normal ogive model and the logistic model on the 
graded response level, respectively. We can see that a practical unbiasedness holds for a very wide 
range of 9 in both cases, as is expected for a set of graded test items whose response difficulty levels 
are widely distributed, an advantage of graded responses over dichotomous responses. We also notice 
that these two MLE bias functions are almost indistinguishable from each other. Figure 3-4 presents the 
square roots of the original and the two modified test information functions of this hypothetical test of 
graded items, following the normal ogive model and the logistic model. As is expected, the differences 
among the three functions are small for a wide range of 9 in both cases. It is interesting to note, 
however, that in these figures the square roots of the modified test information functions assume higher 
values than the square root of the original test information function at certain points of 9 , and this 
tendency is especially conspicuous in the results of the logistic model. This comes from the fact that 
the MLE bias functions, which *<e presented in Figure 3-3 for both models, have tiny ups and downs, 
and they are not strictly increasing in 8 . 

In each of the examples given above, the difficulty parameters of these items in each test distribute 
widely over the range of 9 of interest, and this fact is the main reason that the MLE bias function 
assumes relatively small values for a wide range of 6 . We also notice that the resulting two modified 
test information functions are reasonably close to the original test information function. 

For the sake of comparison, Figure 3-5 presents the MLE bias function and the square roots of the 
original and the two modified test information functions, for a hypothetical test of thirty equivalent, 
dichotomous items with the common item parameters, a g = 1.0 and b g = 0.0 , following the logistic 
model. We can see in the first graph of Figure 3-5 that the amount of bias increases rapidly outside 
the range of 6 , (-1.0, 1.0) . The resulting square roots of the two modified test information functions 
demonstrate substantially large decrements from the original [/(0)] 1/2 outside this interval of 9 , as 
we can see in the second graph of Figure 3-5. 

We also notice that in all these examples there are not substantial differences between the results 
of the two modification formulae. This indicates that in these examples it does not make so much 



19 

26 




-5.0 -4.0 



-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 5.0 to 
THETA 
FIGURE 3-4 



Square Roots of the OrigineJ (Solid Line) and the Two Modified (Daahed and Dotted Lines) 
Teat Information Functions of the Hypothetical Teet of Thirty-Five Graded Test Items 
Following the Normal Ogiv* Model and the Logietic Model, Respectively. 
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-5.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 5.0 

THETA 

FIGURE 3-5 



MLE Bias Function of the Hypothetical Test of Thirty Equivalent Test Items Following 
the Logistic Model with a t = 1.0 and b t = 0.0 As the Common Parameters (Above), 
and Square Roots of the Original (Solid Line) and the Two Modified (Dashed and 
Dotted Lines) Test Information Functions of the Same Test (Below). 
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difference if we chooee Modification Formula No. 1 or Modification Formula No. 2. We should not 
generalise this conclusion to other situations, however, until we have tried these modification formulae 
on different types of data sets. 

[III.6] Minimum Bounds of Variance and Mean Squared Error for the 
Transformed Latent Variable 

Since most psychological scales, including those in latent trait models, are subject to monotone 
transformation, we neei? to consider information functions that are based upon the transfomed latent 
variable. Let r denote a transformed latent variable, i.e., 

(3-54) r = r (B) . 



vice 



We assume that r is strictly increasing in, and three times differentiable with respect to, 6 and „„ 
versa. We have for the operating characteristic, P fc * (r) , of the discrete item response k„ , which is 
defined as a function of r , 

( 3 - 55 ) Pk,{') = prob.{k g \r] = P rob.\k g \6\ = P kf (6) , 

and by local independence we can write for the operating characteristic of the response pattern, P; (r) , 
( 3 56 ) *M = II W = II P >A 6 ) = ■ 

k § tV k 9 tV 

As before, the item response information function, /£ (r) , is defined by 

< 3 - 57 > W = -~ logP; f (r) , 

and for the item information function, /;(r) , and the test information function, 7*(r) , we can write 
from (3.57), (2.3) and (2.8) 



(3.58) 



w = E w *w = E i£ W w,mi 



- E ~i 3 wr 1 = /.(#) i£p 



and 

(3-59) r(r) = ± W = 1(8) , 

0=1 

respectively. Let t£ be any estimator of r , which may be biased or unbiased. In general 
write 

(360) E(ry I r) = r + £(rj-r|r) , 

and, differentiating (3.60) with respect to 6 , we obtain 



we can 
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(3.M) jjEtf | r) = |I + ±£« - r | r) . 

Since from (3.56) we can alto write for E(ty \ r) 

(3.62) Eft | r) = £ J£(r) = £ JVM , 

differentiating (3.62) with respect to 0 and following a logic similar to that used in Section 3.1, we 
obtain 



(3.63) -EM | r) = -£ r; JV(f ) = - Eft \ r)\ 

= EW-WIOlllrlogJVWlJVW • 
v 

By the Cramer-Rao inequality, we can write 

(5.64) [JL^ | r)] 2 < Var.(r; | r) £[{^log JV(*)} 2 ] , 
and from thia, (2.7), (2.8), (3.10) and (3.61) we obtain 

(3.65) Var.(r}\r) > | r)] 2 [/(*)]-» 

Thus the rightest hand side of (3.65) provides us with the minimum variance bound of any estimator of 
r . When Ty is an unbiased estimator of r , the second term of the first factor of the rightest hand 
side of (3.45) equals sero, and by virtue of (3.59) the inequality is reduced to 

(3-66) Var.(T}\t) > [|l] 2 [/(*)]"» = [/»]"» . 



For the mean squared error, E[(ty - r) 2 | r] , we can write 

(3.67) - r) 2 | r] = Var.tf \ r) + \E(r} \ r) - r] 2 , 
and from this and (3.65) we obtain 

(3.68) E[( h - r) 2 | r) > (|j + ^E(r^ - r | r)] 2 [/(*)]-» + [Eft | r) - r] 2 
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[III.7] Modified Test Information Functions Based upon the Transformed 
Latent Variable 

^Z^T^^ "j" * tOT ; ^ ' ° f ' ' C1U1 be ° btain<d by ihe direct ^.formation of the 
maximum likelihood estimate, $y , of 6 , i.e., 

Let B*(fv | r) be the MLE bias function defined for the transformed latent variable r , i.e., 
( 37 °) 5*(V|r) = E(fV-r|r) . 

FVom this, (3.65) and (3.68) we obtain 
(3.71) 



Var -^l r )^[| + ^*(V|r)H/(tf)]-» 



and 



(3.72) 



- r) 2 | T J > [ Jl + | r))>M6)r + [5*(V | r)]> . 



L h t Z C ^Zt n 1 th ! righ J ""T f ideS ; f * b ° Ve tW ° P™*« ™ with the two modified 

test information functions for the transformed latent variable r , i.e., 



(3.73) 
and 

(3.74) 



,dr d 



2*(0 = m \{ Te + -iT (V | r)>> + /(*) {B *fr | ,)}>]-! . 



and 



In the general c«e of discrete item responses we can write for the MLE bias function B'itv I r) 
I its derivative with respect to 0 1 1 ' 



(3.75) 



and 



= ^M)|+d/W)]-»|!i , 



(3.76) 



^(V|r) = 



a 3 r 



respectively (cf. Samejima^ 1987). Thus we can use (3.75) and (3.76) in evaluating the modified 
information functions, T*(r) and S*(r) , which are given by (3.73) and (3.74). 



test 
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[IIL8] Discussion and Conclusions 

A minimum bound of any estimator, biased or unbiased, has been considered, and, based on that, 
Modification Formula No. 1 has been proposed for the maximum likelihood estimator, in place of the 
test information function. A minimum bound of the mean squared error of any estimator has also been 
considered, and, based on that, Modification Formula No. 2 in the same context has been proposed. 
Examples have been given. These topics have also been discussed and observed for the mono tonic ally 
transformed latent variable. 

It is expected that these two modification formulae of the test information function can effectively 
be used in order to supplement a relative weakness of the test information function in certain situations. 
Results are yet to come. 

References 

[1] Kendall, M. G. and Stuart, A. The advanced theory of statistics. Vol. 2. New York: Hafner, 1961. 

[2] Lord, F. M. Unbiased estimators of ability parameters, of their variance, and of their parallel-forms 
reliability. Psychometrika, 48, 1933, 233-245. 

[3] Samejima, F. Estimation of ability using a response pattern of graded scores. Psychometrika 
Monograph, No. 17, 1969. 

[4] Samejima, F. A general model for free-response data. Psychometrika Monograph, No. 18, 1972. 

[5] Samejima, F. Effects of individual optimisation in setting boundaries of dichotomous items on 
accuracy of estimation. Applied Psychological Measurement, 1, 1977a, 77-94. 

[6] Samejima, F. A use of the information function in tailored testing. Applied Psychological Measure- 
ment, 1, 1977b, 233-247. 

[7] Samejima, F. Constant information model: a new promising item characteristic function. 
ONR/RR-79-1, 1979a. 

[8] Samejima, F. Convergence of the conditional distribution of the maximum likelihood estimate, 
given latent trait, to the asymptotic normality: Observations made through the constant infor- 
mation model. ONR/RR-79-3, 1979b. 

[9] Samejima, F. Final Report: Efficient methods of estimating the operating characteristics of item 
response categories and challenge to a new model for the multiple-choice item. Final Report of 
N00014-77-C-0360, Office of Naval Research, 1981. 

[10] Samejima, F. Plausibility functions of Iowa Vocabulary test items Estimated by the Simple Sum 
Procedure of the Conditional P.D.F. Approach. ONR/RR-84-l, 1984a. 

[ll] Samejima, F. Comparison of the estimated item parameters of Shiba's Word/Phrase Comprehen- 
sion Tests obtained by Logist 5 and those by the tetrachoric method. ONR/RR-84-2, 1984b. 

[12] Samejima, F. Bias function of the maximum likelihood estimate of ability for discrete item re- 
sponses. ONR/RR-87-1, 1987. 

[13] Samejima, F. Final Report: Advancement of latent trait theory. Final Report of N00O14-81-C- 
0569, Office of Naval Research, 1988. 

[14] Samejima, F. Modifications of the test information function. ONR/RR-90-1, 1990. 




IV Reliability Coefficient and Standard Error of Measure- 
ment in Classical Mental Test Theory Predicted in the 
Context of Latent Trait Models 

By virtue of tht popuUtion-free characteristic of the tort information function 1(6) addin* further 

cTJ"al P Ztd £ T ^ "J Jf** errOT ° f m ~-' - the sense o 

classical mental test theory for each and every specified group of examinees who have taken th, .» m . 

teat c f. Samejima, 1977b, 1987)! This is farther facUitated by the wn^^Zukl^ST 
test ^formation function, which uae the MLE bias function (d. Sa2mTl98 iXrt tSEL I 
introduced in the preceding chapter. * samejima, is*?, 1990), and have been 

[IV.l] General Case 

Let 61 be any estimator of ability 6 . We can write 
W K = # + . , 

where s denotes the error variable. In the test-retest situation, we have 

Ki = *+*i 



(4.2) 



(4.3) 



Cou.fe^ej) = C 



(4.4) 
and 



Var.(e,) = Var.(e 2 ) 



(4.5) 

then we will have 



Cov.(6,e x ) = Cou.(tf,e a ) = 0 , 



(4.6) 



t^r^/ Th rCP K^ e ^ ^ v C trUC tMt SC ° re T ' * transf °™«d form of 6 specific to a given 

M-ZZ£Z?£lZT X — ^ its error of P estima t ion g th e H 
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(4.7) 



X = T+E , 



whkh represents the fundamental assumption in classical mental test theory, and (4.6) becomes a 
familiar formula for the reliability coefficient r x ,x, , 

(<•«) r XlXi = Var.CniVarlX)}-' . 

In classical mental test theory, however, researchers seldom check if these assumptions are acceptable. 
In fact, in many cases (4.5) is violated if we replace 8 by T , and «! and e a by E x and E 2 , 
respectively, unless the test has been constructed in such a way that most individuals from the target 
population have mediocre true scores. 

We can write in general 

(4.9) Var.[e) = E[e- E(e)} 3 

= E[e - E(e | tf)] 3 + E\E(e | 6) - E(e)\ 2 
+ 2E\[€ - E{t | B)){E{t | 6) - E(t))\ . 

This indicates that, if the error variable c is conditionally unbiased for the interval of 6 of interest, 
then (4.9) will be reduced to the form 

(410) Var.(e) = E[e 3 \ . 

[IV.2] Reliability Coefficient of a Test in the Sense of Classical Mental 
Test Theory When the Maximum Likelihood Estimator of 9 Is 
Used 

Let By or 9 denote the maximum likelihood estimator of 6 based upon the response pattern 
V . If: 1) § is conditionally unbiased for the interval of 6 of interest and 2) the test information 
function 1(6) assumes reasonably high values for that interval, then we will be able to approximate the 
conditional distribution of § , given B , by the normal distribution N{6, [/(*)]" 1/2 ) for the interval 
of 6 within which the examinees' ability practically distributes. Thus we have from (4.10) 

Var.(e) = E[{I(6)y 1 } . 

When this is the case, from (4.6) we can write 

(412) Corr.($J 2 ) = \Var\h) - ^{/(tf)}- 1 ]]^.^,)]- 1 . 

Thus the reliability coefficient in the sense of classical mental test theory can be predicted by a single 
administration of the test, given the test information function 1(6) and the ability distribution of the 



examinees. 
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The appropriateness of the above normal approximation of the conditional distribution of 6 given 
6 , can be examined by the Monte Carlo method (cf. Samejima, 1977a). We also notice that a necessary 
condition for this approximation is that 6 is conditionally unbiased for the interval of 6 of interest 
Thus we can use the MLE bias function, which was introduced in Section 2, for a test for the support 
of the approximation. Note that the MLE bias function together with the ability distribution of the 
target population also determines whether the assumption described by (4.5) should be accepted. 

If the conditional unbiasedness is nonsupported, i.e., if B(9 V | 6) does not approximately equal 
sero for all values of 9 in the interval of interest, however, then we shall be able to adopt one of the 
modified test information functions, T(tf) or 5(0) . Thus we can rewrite (4.12) into the forms 



(4.13) 



Corr.(9 lt e 2 ) = [Var.(^) - ^{T^j-^fVar.^)]- 1 



and 



(4.14) 



Corr.(e lt $ 3 ) = [Var.^O-^KHJtf)}- 1 ]]^^.^)]- 1 . 



We can decide which of the modified formulae, (4.13) or (4.14), is more appropriate to 
situation. 



use in a specified 



[IV.3] Standard Error of Measurement of a Test in the Sense of Classical 
Mental Test Theory When the Maximum Likelihood Estimator of 
$ Is Used 

In classical mental test theory, the standard error of estimation of ability is represented by a single 
number, which is heavily affected by the degree of heterogeneity of the group of examinees tested, 
as is the case with the reliability coefficient. In contrast, in latent trait models, the standard error of 
estimation is locally defined, i.e., as a function of ability. It is usually represented by the reciprocal of the 
square root of the test information function. Since the test information function does not depend upon 
any specific group of examinees, but is a ,ole property of the test itself, this locally defined standard 
error is much more appropriate than the standard error of estimation in classical mental test theory 
Also this function indicates that no test is efficient in ability measurement for the entire range of ability, 
and each test provides u. with large amount, of information oniy locally, which makes a perfect sense 
to our knowledge. 

The standard error of measurement of a test tailored for a specific ability distribution is given by 



(4.15) 



s.e. = Eime)}- 1 ' 3 ] 



when the conditions 1) and 2) described in the preceding section are met, and by 
( 416 ) S.E.I = E[{T{8)}- 1 ' 3 ] 



(4.17) 
otherwise. 



S.E.2 = E[{E(6)}- 1 ^] 
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-5.0 -4.0 -3.0 -2.0 -1.0 0.0 10 2.0 3.0 4.0 

THETA 

FIGURE 4-1 

Density Functions of Six Hypothetical Ability Distributions: n(0.0, 1.0), 
n(-0.8, 1.0), n(0.0, 0.5), n(-0.8, 0.5), n(-1.6, 0.5) and n(-2.4, 0.5). 



[rv\4] Examples 

For the purpose of illustration, six ability distributions are hypothesised, and for a single test 
predictions are made for their tailored reliability coefficients and tailored standard errors of measurement 
in the sense of classical mental test theory, using (4 12), (4.13), (4.14), (4.15), (4.16) and (4.17). These six 
hypothetical ability distributions are normal distributions, i.e., JV(0.0, 1.0) , JV(-0.8, 1.0) , JV(0.0, 0.5) , 
JV(-0.8,0.5) , A'(-1.6,0.5) and JV*(-2.4,0.5) . Figure 4-1 presents the density functions of these six 
distributions. The hypothetical test used here is the same one introduced in the preceding chapter, 
which consists of thirty equivalent dichotomous items following the logistic model represented by (3.43) 
with the common values of parameters, a g = 1.0 and b g m 0.0 , respectively, and with the scaling 
factor D set equil to 1.7 . The MLE bias function and the square roots of the test information 
function J($) and of its two modification formulae T(0) and E(0) of this test are shown in Figure 
3-5 of the preceding chapter. 

Tables 4-1 and 4-2 present the resulting predicted reliability coefficients and standard errors of 
measurement for the six different ability distributions, respectively. In each table, the mean and the 
variance of I of each of the six distributions are also given. We can see that these variances are slightly 
different from the squares of the second parameters of the normal distributions, i.e., 0.98322 vs. 
1.00000 for the populations 1 and 2, and 0.25155 vs. 0.25000 for the populations 3, 4, 5 and 6 t 
respectively, whereas all of the meant are the same as the first parameters of the normal distributions. 
These discrepancies in variance come from the fact that we used frequencies for the equally spaced 
points of I with the step width 0.05 , which are given as integers, in order to approximate the normal 
distributions, instead of using the density functions themselves. 

As you can see in the first table, the predicted reliability coefficient obtained by (4.12) distributes 
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TABLE 4-1 



Three Predicted Reliability Coefficients Tailored for Each of the Six Hypothetical Ability 
Distributions, Usinf the Orifinal Teat Information Function and Its Two Modificatiot 
Formulae. The Indicea, 1, 2 and S, Represent the Original Teat Information Function, 
Modification Formula No. 1 and Modification Formula No. 2, Reapectively. The 
Mean and the Variance of t for Each Population Are Also Given, 



POPULATION 


RELIABILITY 
1 


RELIABILITY 
2 


RELIABILITY 
3 


MEAN OF 
THETA 


VARIANCE 
OF THETA 


1 


0.69641 


0.78053 


u. 76629 


0.00000 


0.98322 


2 


0.82324 


0.26479 


0.25256 


-0.80000 


0.98322 


3 


0.81738 


0.80074 


0.79920 


0.00000 


0.25155 


4 


0.73250 


0.66611 


0.65589 


-0.80000 


0.25155 


5 


0.47715 


0.21681 


0.20093 


-1.60000 


0.25155 


6 


0.20049 


0.01182 


0.01109 


-2.40000 


0.25155 



TABLE 4-2 

Three Predicted Standard Errors of Measurement Tailored for Each of the Six Hypothetical 
Ability Distributions, Using the Original Test Information Function and Its Two 
Modification Formulae. The Indices, 1, 2 and 3, Represent the Original Test 
Information Function, Modification Formula No. 1 and Modification 
Formula No. 2, Reapectively. The Mean and the Variance of 6 for 
Each Population Are Also Given. 



FOPULATION 


STAND . ERROR 
1 


STAND • ERROR 
2 


STAND. ERROR 
3 


MEAN OF 
THETA 


VARIANCE 
OF THETA 


1 


0.30548 


0.37648 


0.38514 


0.00000 


0.98322 


2 


0.37887 


0.64293 


0.66397 


-0.80000 


0.98322 


3 


0.23521 


0.24717 


0.24811 


0.00000 


0.25155 


4 


0.29172 


0.32802 


0.33326 


-0.80000 


0.25155 


5 


0.48839 


0.73440 


0.76583 


-1.60000 


0.25155 


6 


0.91974 


2.76394 


2.88922 


-2.40000 


0.25155 
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TABLE 4-3 

Three Theoretical Variaacee of ch. Mumrom I elihood Eelimatee of < [or Each 
of the Six Hypothetical Ability DiUribotioDi, Ueiaf the Ori(in«l Teat Information 
Fnnction aid Ita Too Modulation flormulae. 11a hdieaa, 1, 2 and 8, Repreeett 
the Original Test Information Function, Modification Formula No. 1 and 
Modification Famula No. 2, Respectively. The Mean and the Variance 
of 9 for Each Population Are Also Given. 



POPULATION 


VARIANCE 
OF MIX 1 


VARIANCE 
OF MLE 2 


VARIANCE 
OF MLE 3 


MEAN OF 
THETA 


VARIANCE 
OF THETA 


1 


1.09684 


1.25968 


1.28308 


0.00000 


0.98322 


2 


1.19432 


3.7X324 


3.89296 


-0.80000 


0.98322 


3 


0.30775 


0.31414 


0.31475 


0.00000 


0.25155 


4 


0.34341 


0.37763 


0.38352 


-0.80000 


0.25155 


5 


0.52718 


1.16023 


1.25189 


-1.60000 


0.25155 


6 


1.25469 


21.28788 


22.68190 


-2.40000 


0.25155 



TABLE 4-4 



Three Theoretical Error Variances for Each of the Six Hypothetical Ability Distributions, 
Using the Original Test Information Function and Its Two Modification Formula. The 
Indices, 1, 2 and 3, Represent the Original Test Information Function, Modification 
Formula No. 1 and Modification Formula No. 2, Respectively. The Mean and the 
Variance of 9 for Each Population Are Abo Given. 



POPULATION 


VARIANCE 
OF ERROR 1 


VARIANCE 
OF ERROR 2 


VARIANCE 
OF ERROR 3 


MEAN OF 
THETA 


VARIANCE 
OF THETA 


1 


0.11363 


0 


.27646 


0.29987 


0.00000 


0 


98322 


2 


0.21111 


2 


.73003 


2.90974 


-0.80000 


0 


98322 


3 


0.05620 


0 


.06260 


0.06320 


0.00000 


0. 




4 


0.09186 


0 


.12609 


0.13197 


-0.80000 


0. 


25155 


5 


0.27563 


0 


.90868 


1.00034 


-1.60000 


0. 


25155 


6 


1.00314 


21 


.03633 


22.43035 


-2.4'JOOO 


0. 


25155 




TABLE 4-5 



Reliability Coefficient Computed for Each of the Six Hypothetical Ability Distributions Based 
UP °« * h, J M * ximnm Likelihood Estimate, of the Examinee* for Teet-Retest Situations Using 
a Test of Thirty Equivalent Items Following the Logistic Model with D = 1.7 , a 9 = 1.0 
and b, = 0.0 . The Means and Variances of the Two Sessions and the Covariances Are 

Also Presented. 



POPULATION 


RELIABILITY 


MEAN 
1 


MEAN 

2 


VARIANCE 
1 


VARIANCE 
2 


COVARIANCE 


1 


0.90788 


-0.00311 


0.00106 


1.19069 


1.16769 


1.07051 


2 


0.68812 


-0.81435 


-0.80971 


1.07982 


1.09703 


0.96663 


3 


C. 80724 


0.00785 


-0.00754 


0.33578 


0.33443 


0.27051 


4 


0.72334 


-0.85777 


-0.84349 


0.40504 


0.39310 


0.28863 


5 


0.55304 


-1.68722 


-1.67511 


0.42299 


0.40820 


0.22980 


6 


0.32187 


-2.28115 


-2.25897 


0.21639 


0.23189 


0.07210 



widely ,.e., it varies from 0.200 to 0.896 ! The coefficient reduces as the main part of the distribution 
shifts from a range of 9 where the amount of test information is greater to another range where it is 
u T it ™ ductl0 ^ " more c on«P>euous when the standard deviation of the normal distribution is 
smaller. The predicted reliability coefficient obtained by (4.13) using T(0) instead of 1(9) indicates 
a substantial reduction from the one obtained by (4.12) for each of the six ability distributions. The 
reduction is especially conspicuu« for the populations 2 , 5 , and 6 , whose ability distributes on lower 
a- diKrepancie. between 1(9) and 7(9) are large. Among the six populations 

the predicted reliability coefficient obtained by means of (4.13) varies from 0.012 to 0 781 showing 
an 1 d t t :,? n lu *i T rmn|e than th * obt * ined bv (*•")• Similar results were obtained for the predicted 
n^ ty n C ?-5 Ci ' ,,tgiVenby(414, ' ,Uing E{$) 1(9). The reliability coefficient varies from 

0.011 to 0.766 «nd within each population the reduction in the value of the reliability coefficient from 
the one obtained by (4.13) is relatively small, as is expected from the second graph of Figure 3-5. 

As for the stand- d error of measurement, we can see in Table 4-2 that .imilar results were obtained, 
only in reversed order, of course. In classical mental test theory, the standard error of measurement 
<r£ is given by 



(4.18) 



c B ^[Var.(X)\^[l-r XlXi )^ 



where, as before, r XxXt indicates the reliability coefficient. Comparison of Table 4-1 and Table 4-2 
reveals that there are substantial discrepancies between the values of a B obtained by formula (4 18) 
wing the tailoni reliability coefficients in Table 4-1, which are based upon the maximum likelihood 
estimate t , in place of r XlXi in (4.18) and the corresponding standard errors of measurement, which 
were obtained by formulae (4.15) through (4.17) and presented in Table 4-2. To give some examples, 
for Population No. 1 the results of (4.18) are: 0.319 , 0.465 and 0.479 , respectively; for Population 
No. 3 they are: 0.214 , 0.224 and 0.225 ; and for Population No. 6 they are: 0 448 0 499 and 
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0.499 . These results are understandable, for the degree of violation from the assumptions behind the 
classical mental test theory is different for the separate ability distributions. 

The three theoretical variances of the maximum likelihood estimate of 9 and the three theoretical 
error variances are presented in Tables 4-3 and 4-4, respectively, for each of the six hypothetical pop- 
ulations. The latter were obtained by (4.11) and by replacing 1(9) in (4.11) by 1(9) and E(9) , 
respectively, and the former are the sum of these separate error variances and the variance of 0 

In order to satisfy our curiosity, a simulation study has been made in such a way chat, following 
each of the six ability distributions, a group of examinees is hypothesised, and, using the Monte Carlo 
method, a response pattern of each hypothetical subject is produced for each of the test and retest 
situations. Since our test consists of thirty equivalent dichotomous test items, the simple test score is a 
sufficient statistic for the response pattern, and the maximum likelihood estimate of 9 can be obtained 
upon this sufficient statistic. The numbers of hypothetical subjects are 1,998 for Populations No. 1 
and No. 2, and 2,004 for Populations No. 3, No. 4, No. 5 and No. 6. The correlation coefficient 
between the two sets of S was computed, and the results are presented in Table 4-5. Comparison of 
each of these results with the corresponding three tailored reliability coefficients in Table 4-1 gives the 
impression that, overall, these correlation coefficients are higher than the predicted tailored reliability 
coefficients. This enhancement comes from the fact that in each distribution there are a certain number 
of subjects who obtained negative or positive infinity »j § , and we have replaced these negative and 
positive infinities by more or less arbitrary values, -.2.65 and 2.65 , respectively, in computing the 
correlation coefficients. Since in Population No. 3 none of the 2,004 hypothetical subjects got negative 
or positive infinity for their maximum likelihood estimates of 9 in the first session, and only three got 
negative infinity and none got positive infinity in the second session, this result, 0.807 , will be the 
most trustworthy value. We can see that this value, 0.807 , is less than 0.817 obtained by using the 
original test information function I{9) , and a little greater than 0.801 obtained upon the Modification 
Formula No. 1, T(9) . The next most trustworthy value may be 0.723 of Population No. 4, for which 
none of the 2,004 subjects obtained positive infinity as their § 's in each of th* two sessions, and 56 
and 45 got negative infinity in the first and second sessions, respectively. This value of the correlation 
coefficient, 0.723 , is a little less than the predicted reliability coefficient 0.733 obtained upon 1(9) , 
but somewhat greater than 0.666 , which is based upon T(9) , the Modification Formula No. 1— the 
artificial enhancement is already visible. The numbers of subjects who obtained negative and positive 
infinities in the first session and in the second session are: 56 , 47 , 43 and 49 for Population No. 
1; 197 , 4 , 195 and 6 for Population No. 2; 437 , 0 , 399 and 0 for Population No. 5; and 
1, 143 , 0 , 1, 118 and 0 for Population No. 6. We must say that, for these four distributions, the 
values of the correlation coefficients in Table 4-5 should not be taken too seriously, for these values are 
enhanced because of the involvement of too many substitute values i'or negative and positive infii \ties. 

[IV.5] Discussion and Conclusions 

Test information function 1(9) and its two modification formulae, T(6) and E(0) , have been 
used to predict the reliability coefficient and the standard error of measurement which are tailored for 
each specific ability distribution. Examples of the prediction have been given and a simulation study 
has been conducted and shown for comparison. These examples using equivalent test items have been 
rather intentionally chosen to make the differences among the separate ability distributions, and those 
among the three predicted indices for each ability distribution, clearly visible. 

Since we have more useful and informative measures like the test information function and its two 
modified formulae, the reliability coefficient of a test is no longer necessary in modern mental test 
theory. And yet it is interesting to know how to predict the coefficient using these functions, which are 
tailored for each separate population of examinees. In this process, it will become more obvious that 
the traditional concept of test reliability is misleading, for without changing the test the coefficient can 
be drastically different if we change the population of examinees. 
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V Validity Measures in the Context of Latent Trait Models 

.un^ t *I; , " Ci " ntifiC P ° b ! ° f u ieW> ^ nMd *° C ° nfirm * * * iven measure, what it is 

7 M " ^ ^ en ° U ' h b re *" d ^ their contents, and 

even if we are equipped with highly sophisticated mathematics. 

By virtue of the population-free nature of latent trait theory, we should be able to find some indices 
££T T^J* Tt i Whkh *" ^ lffeC * ed by the « roU P of The resuWng 

^e Urn tie t-t A ! " ST ™ ClWiCal r ' K ' nt,J *"* «• but tr ^ be ^tribute, of 

the item and the test themselves. Thus an attempt ha. been made in the prewnt research to obtain 

such population-free measure, of item validity and of test validity, which are basically locally defined 

[V.l] Performance Function: Regression of the External Criterion Vari- 
able on the Latent Variable 

mdLuv^ZTL'tV' J"!."* 1 !.? ••*? nUl Crite " 0n Wiable ' Whkh can be meMured diraetly or 
2S? the 1 ,,tU4t ! 0n Wh f U <*> «" umed wh.» we deal with criterion-oriented validly or 
predictive validity m classical mental test theory. 

Let 7 denote the criterion variable, representing the performance in a specific «ob, etc We shall 
con.; c- the conditional density of the criterion performance, given ability, and denote it by fa I 6] 
Tht performance function, {(0) , can be defined a. the regression of 7 on 9 or by takiL sav the 
75, 90 or 95 percentile point of each conditional dbtribution of 7 , ]iZ 6 ' Let T Lnl L 
probabUity which is large enough to satisfy u, as a confidence level Thus we can write 

(51) p<>= f (d\e)d 1 , 

wheie 7 denotes the least upper bound of the criterion variable 7 . 

to assume 6 t^h^ ^ \*f?*& ™ on * 9 • t ' - «7 I #) and ? («) . It may be reasonable 
bvXT li r hefu " Ct,0na E 1 " ^'on»h>P between f and ( (|) is relatively simple, no t as is illustrated 

MsFl r ,n T7 ^ ' ! , , We d ° eXpe " ft # ) to «° U P and d °™ «X within a 

relatively short range of 6 . We shall assume that S [i) i, twice differentiable with respect to 6 
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In dealing with an additional dimension or dimensions in latent space, i.e., the criterion variable or 
variables, one of the most difficult issues is to keep the population-free nature, which is characteristic of 
the latent trait models, the main feature that distinguishes the theory from classical mental test theory, 
among others. If we consider the projection of the operating characteristic of a discrete item response 
on the criterion dimension, for example, then the resulting operating characteristic as a function of 7 
has to be incidental, for it has to be affected by the population distribution of 6 . 

We need to start from the conditional distribution of 7 , given $ , therefore, which can be conceived 
of as being intrinsic in the relationship between the two variables, and independent of the population 
distribution of $ . We assume that takes on the same value only at a finite or an enumerable 

number of points of . . Let P£ (^) be the conditional probability assigned to the discrete response 
kg , given £ . We can write 

( 52 ) = E P >.M • 

f(*)=f 

[V.2] When f (0) Is Strictly Increasing in 6 : Simplest Case 

The simplest case is that is strictly increasing in 8 . In this case, f (6) has a one-to-one 

correspondence with 8 , and (5.2) becomes simplified into the form 

( 5 - 3 ) Pk,(i) = PkM*)) = • 

If, in addition, {d8/d{ } is finite throughout the entire range of 8 , then we obtain 

Ik 9 i() be the item response information function defined as a function of f . We can write 

( 5 - 5 ) = -£*^M = -£i<£>ogJU')>§£i 

= ^w(g. f -i^wiiA,wr l 0 • 

k«t and /*(£) be the amounts of information given by a single item g and by the total 

test, respectively, for a fixed value of f . Then we have from (2.3), (2.8) and (5.5) 

( 5 - 6 ) W - E W M = £ «.(*) PW = /.(•) (^) 2 

and 

If we take the square roots of these two information functions defined for £ , then we obtain 

( 5 -«) [/;(f)i ,/a - i/,(*)i ,/a f£ 
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Relationships among 9 , 7 , Po , £( 7 | g) and f ^ j 




and 



(5.9) ™i 1/3 = [« /3 f^ ■ 

Since a certain constant nature exists for the square root of the item information function while the 
same is not true with the original item information function (cf. Samejima, 1979, 1982), [Ig($)] 1 / 7 
given by (5.8) instead of the original function given by (5.6) may be more useful in some occasions. 
This will be discussed later in this section, when the validity in selection plus classification is discussed. 

Suppose that we have a critical value, 70 1 of the criterion variable, which is needed for succeeding 
in a specified job, and that we try to accept applicants whoae values of the criterion variable are 70 
or greater. If our primary purpose of testing is to make an accurate selection of applicants, then 

(5.8) and (5.9) for £ = 70 > or their squared values shown by (5.6) and (5.7), indicate item and test 
validities, respectively. If for some item formula (5.8) or (5.6) assumes a high value at £ = 70 , then 
the standard error of estimation of £ around £ = 70 becomes small and chances are slim that we 
make misclassifications of the applicants by accepting unqualified persons and rejecting qualified ones, 
and the reversed relationship holds when (5.8) or (5.6) assumes a low value at £ = 70 . The same logic 
applies to the total test by using formula (5.9) or (5.7) instead of (5.8) or (5.6). 

It should be noted in (5.8) or in (5.9), that [/;(7o)] l/2 or [/*(7o)] l/2 consists of two factors, 
i.e., 1) the square root of the item information function I g {9) or that of the test information function 
1(9) and 2) the partial derivative of ability 9 with respect to £ at f = 70 • These two factors in 
each formula are indcpend nt of each other, i.e., one belongs to the item or to the test and the other 
to the statistical relationship between 9 and 7 . We also notice that these two factors are in a 
supplementary relationship. Thus while it is important to have a large amount of item information, or 
of test information, it is even more so to have large values of the derivative, {39/ d$) , in the vicinity 
of £ = 7r 1 for this will increase the amount of item information defined with respect to £ uniformly 
in that vicLiity, and also that of test information, as is obvious from the right hand sides of (5.8) and 

(5.9) . In other words, it is desirable for the purpose of selection for $ to increase slowly in 9 in the 
vicinity of $ = 70 . 

Since, in general, the same ability 9 has predictabilities for more than one kind of job performance, 
or of potential of achievement, the performance function varies for different criterion variables. Note 
that neither \I g (9)\ x l 2 nor [/(0)] 1/2 is changed even when the criterion variable is switched. Thus, 
for a fixed item or test whose amount of information is reasonably large around £ = 70 , the derivative 
{39/ df } in the vicinity of £ = 70 determines the appropriateness of the use of the item or of the test for 
the purpose of selection with respect to a specific job, etc. If this derivative assumes a high value, then 
an item or a test which provides us with a medium amount of information may be acceptable for our 
purpose of selection, while we will need an item or a test whose amount of information is substantially 
larger if the derivative is low. Also for the same criterion variable 7 the derivative {36/3$} varies for 
different values of 70 , so the appropriateness of an item or of a test depends upon our choice of 70 , 
too. The above logic also applies for the formulae (5.6) and (5.7), i.e., for the case in which we choose 
the information functions, instead of their square roots, changing {39/3$} to its squared value. 

It is obvious from (5.6) and (5.8) that we can choose either / 0 (0(7o)) or [A,(0(7o))J l/2 for use in 
item selection, for their rank orders across different items are identical, and they equal the rank orders 
of /;(7o) as well as those of [/;(7o)] l/2 . 

If we take another standpoint that our purpose of testing is not only to make a right selection of 
applicants but also to predict the degree of success in the job for each selected individual, then we will 
need to integrate [/* (c)] 1 / 2 and [/*(c)] 1/f2 , respectively, since we must estimate £ accurately not 
only around £ = 70 but also for £ > 70 . If we choose [/^(^)] 1/2 and |/*(c)] 1/2 in preference to 
their squared values, we will obtain from (5.8) and (5.9) 
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Some Examples of the Relationship between and the Item Validity Measure 

Given by (5.10). 



(5.10) 

and 

(5.11) 



■'Of Jn, 

t in?)i 1/3 <<?= / \i(8)Yi*d6 , 

J o, Jn, 



where f) f and fl, indicate the domains of f and 9 for which { (6) > 7o , respectively. In this 
situation we need to select items which assume high values of (5.10) instead of (5.8), or a test which 
provide* «. with a high value of (5.11) in place of (5.9). Note that formulae (5.10) and (5.11) imply that 
we can obtain these two validity measures directly from the original item and test information functions 
respectively, i.e., without actually trantforming 6 to f , as long as we can identify the domain f). 
This is true for any criterion variable f . 

Some examples illustrating the values of (5.10) are given in Figure 5-3 for hypothetical items. In the 
simplest case observed in this section and illustrated in Figures 5-1 and 5-3, the* two domains, f), 
and n f , are provided by the two intervals, (tf„ , oo ) and (<y„ , l) , where 



(5.12) 



6 0 = tf(Tfo) 



and 7 denotes the least upper bo'ind of 7 . 

It should be noted that the above pair of validity measures depends upon our choice of the critical 
value T, 0 . If this value is low, i.e., a specified job does not require high levels of competence with 
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FIGURE 5-4 



Relationship between 70 and Item Validity Indicated by (5.10) for Three Hypothetical 
Dichotomous Items Whose Operating Characteristics for the Correct Answer Are 
Strictly Increasing with Zero and Unity as Their Asymptotes. 



respect to the criterion variable 7 , then these validity indices assume high values, and vice versa. 
It has been pointed out (Samejima, 1979, 19S2) that there is a certain constancy in the amount of 
information provided by a single test item. To give an example, if an item is dichotomously scored and 
has a strictly increasing operating characteristic for success with sero and unity as its two asymptotes, 
then the area under the curve for [/ a (0)J l/a equals * , regardless of the mathematical form of the 
operating characteristic and its parameter values. We can see, therefore, that if our items belong to this 
type then the functional relationship between 70 and the item validity measure given by (5.10) will 
be monotone decreasing, with * and sero as its two asymptotes, for each and every item. Figure 5-4 
illustrates this relationship for three hypothetical items of this type. As we can see in this figure, the 
appropriateness of the items changes with 70 in an absolute sense, and also relatively to other items 
with 7 0 , and the rank orders of desirability among the items depend upon our choice of 70 . 

We can see from (5.10) that this validity measure necessarily assumes a high value if an item is 
difficult, and the same applies to (5.11) for the total test. This implies that these validity measures 
alone cannot indicate the desirability of an item and of a test precisely for a specific population of 
examinees. In selecting items or a test, therefore, it is desirable to take the ability distribution of the 
examinees into account, if the information concerning the ability distribution of a target population is 
more or less available. In so doing we shall be able to avoid choosing items which are too difficult for 
the target population of examinees. Let f(B) denote the density function of the ability distribution 
for a specific population of examinees, and /*(?) be that of £ for the same population. Then we can 
write 



Adopting this as the weight function, from (5.8) and (5.9) we obtain &s <ne validity indices tailored tor 
a specific population of examinees 



(5.13) 
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< 514 > / iW\ 1/3 r(() <k = / [/,(*)]»/> ne) d ± de 



and 



< 515 > / W /-(f) * = / /(«) # df . 

J n* Jo, d$ 

Thu. by u.ing (5 14) and (5 15) inefad of (5.10) and (5.11) we .hall be able to make appropriate item 

^Sf? i K r leCtK>n ^ ' WWa<t ■ 0,, * ' Wnple ' provMed th »* informal 

/Sfrt^ i J 0 * u m<>re * kM W * a * ble - Note th *' " nlike (510) «d (511), formulae f5H) 
and (5.15) unply that the. validity meaeuru are ako heavUy dependent' upon ihe tJSZ IM. 

roo?:: ots n°or ( 5 th «) sjsf the curve ofthe bformation function ° f that ° f its 8 ^ e 

and 

( 517 > / rw * = / k #) » * , 

hand udee of (5.16) and (5 17) are no longer ^dependent of the functional formula of f (6) . Aleo when 

5l7? I ! , * blh ^ J dbtribution of » »»« et PoP^ion of examinee, is more or less avaUab e 
the tailored item and test validity indices become a uaDie, 

(518) /„ CWfW*-/ «•)/«(?)»* 



and 



(519) / n ru )r /(#)/(#) di , 

respectively, if we choose to use the information function. instead of their .quare roots. 

orde^nfiSi r^ ^JS***. meMure » for P«P°«es, in the present situation the rank 

order, of vahdrty acrou different item., or different test., depend upon the choice of the validity index 
Thu. a que.tion »: which of the formulae, (5.10) or (5.16), and (5.11) or (5 17) are bette Z the' 
torn and the te.t validity indict for .election plus claLficLn pJrpJ.7 A suniiaTq Jest ion "at 

an^ml't'ntT-' 0 ftf "* (518)> *° (515) »* < 519 >- The " ™ *J ^ question, t 
t ?~" ^ "T" ^ ° f * he Uem bform » tion has an advantage of a 

r-rb^l nf^v t " h T ^ th " 8Ub " Ction - the u " ° f 'he *«» Nation 

f 5 17W^V V ! 3 [- , e u by L V, f J tUe ° f (2 8) the 8Um totaJ of («■") °™ item , 's equals 

Jei tobTlaichT 8 reUtK>n8h,P h ° W8 betW " n M md < 5 19 >" The answers to these questions are 

to jS£i,7 • UrPOfle ° f tMting " 8trktIy the ''""ft""™ °f individuals, as in assigning those people 
to different train** programs, in guidance, et , (5.10) and (5.11), or (5.16) and (5.17), alfo serve as the 
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FIGURE 5-5 

Example of the Performance Function ([$) Which I* Piecewise Monotone in $ 



validity measures of an item and of a test, respectively. In this case, we must set Ko 35 ! » defining the 
domains, 0, and fl# , where 2 is the greatest lower bound of 7. Thus the two domains, fl f and ft* , 
in these formulae become those of ( and 8 for which 2 ^ ? (*) ^ 1 • * » obvious that these formulae 
provide us with the item and the test validity measures, respectively, for the same reason explained 
earlier. The same logic applies for the tailored validity measures provided by (5.14) and (5.15), and 
by (5.18) and (5.19), when the information concerning the ability distribution of a target population is 
more or less available 

[V.3] Ttest Validity Measures Obtained from More Accurate Minimum 
Variance Bounds 

When {d(/d8} = 0 at some value of 8 , as is illustrated by a dashed line in Figure 5-2, {d$/d$} 
becomes positive infinity, and so does the item validity measure given by (5.8). This fact provides us 
with some doubt, for, while we can see that At such a point of ( item validity is high, we must wonder 
if positive infinity is an adequate measure. It is ako obvious from (2.8) that ti\e same will happen to the 
total test if it includes at least one such item. Our question is: should we search for more meaningful 
functions than the item and test information functions f This topic will be discussed in this section. 

Necessity of the search for a more accurate measure than the test information function becomes 
more urgent when the performance function, i{8) t ia not strictly increasing in 8 , but is, say, only 
piecewise monotone in * with finite {d8/d(} and differentiate with respect to 8 , as is illustrated 
in Figure 5-5. The illustrated performance function is still simple enough, but indicates the trend that 
after a certain point of ability the performance level in a specified job decreases. This can happen when 
the job does not provide enough challenge for persons of very high ability levels. 
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Since /* (?) serves as the reciprocal of the conditional variance of the maximum likelihood estimate of 
? only asymptotically and there exist more accurate mbimum variance bounds for any (asymptotically) 
unbiased estimator (cf. Kendall and Stuart, 1961), we can search for more accurate test validity 
measures than the one given by (5.9) by using the reciprocal of the square roots of such minimum 
variance bounds. 

Let J r ,(0) be defined as 

(5.20) -1.-M * 



where 



(5.21) 4')(#)-^(#).^JV(#) . 

Let J (6) denote the (k X k) matrix of the element J r9 {0) , and J r VC) be the corresponding element 
of its inverse matrix, J' 1 {6) . Note that when k = 1 we can rewrite (5.20) into the form 



(5.22) J kk (6) = J u (6) = E[{^lo % L v (B)} 7 \e} 

= -El^\ogr\(6)\6} } 

and from this, (2.7) and (2.8) we can see that J{6) is a (1 x 1) matrix whose element is the test 
information function, 1(6) , itself. A set of improved minimum variance bounds is given by 

(5.23) *<•»(#) J,V(#)fM(#) 

r=l 1=1 

(cf. Kendall and Stuart, 1961), where ^{6) denotes the s-th partial derivative of f(6) with respect 
to 6 . We obtain, therefore, for a set of new test validity measures 

("<) iEE'»o , ^ 1 (^))7i r, r 1/8 , 

rslisl 

where ^ f ' indicates the s-th partial derivative of ? with respect to 6 at £ = ^ 0 . 

The use of this new test validity measure will ameliorate the problems caused by {d$/d$} = 0 , if 
we choose an appropriate k . The resulting algorithm will become much more complicated, however, 
and we must expect a substantially larger amount of CPU time for computing these measures when k 
is greater than unity. Note that (5.24) equals (5.9) when k = 1 . 



[V.4] Multidimensional Latent Space 

When our latent space is multidimensional, a generalisation of the idea given in Section 5.3 for the 
unidimensional latent space can be made straightforwardly. We can write 

(5.25) 6={6 U Y «= 1,2, ...,!, , 
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FIGURE 5-6 

Area (1$ for Different 7o # * in Two-Dimensional Latent Space 
for a Hypothesised Test. 



and the performance function ((B) becomes a function of rj independent variables. A minimum 
variance bound is given by 

where /-*(*) is the (u,v)-th element of the inverse matrix of the (r? x tf) symmetric matrix, whose 
element is given by 

tun '■•<'>- 4 l£J> 

with L abbreviating Lv{6) , or I\,(8) . The reciprocal of the square root of (5.27) will provide us 
with the counterpart of (5.9) for the multidimensional latent epace. For f? - 2 , the area. fl» may look 
like one of the contoure illustrated in Figure 5-6, depending upon our choice of to , taking the axis for 
1 vertical to ihc plane defined by B\ and £3 • 

In a more complex situation where both ability and the criterion variables are multidimensional, we 
must consider the projection of the item information function on the criterion subspace from the ability 
subspace, in order to have the item validity function for each item, and then the test validity function. 
It is anticipated that we must deal with a higher mathematical complexity in such a case. The situation 
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will [substantially be .implified, however, if the total set of items consist, of several subsets of items, 
each of which measures, exclusively, a single abUity dimension and a single criterion dimension. 

[V.5] Discussion and Conclusions 

Some considerations have been made concerning the validity of a test and that of a single item 
Effort has been focused upon searching for measures which are population-free, and which will provide 
",. W vii ™ . * bundM1 t information just as the information functions do in comparison with the test 
reliability coefficient in classical mental test theory. In so doing, validity indices for different purposes 
of testing and also those which are tailored for a specific population of examinees have been considered. 

The above consideration, for the item and teat validities may be just part of many possible ap- 
proaches. We may .tUl have a long way to go before w. diacover the most useful measures of the item 
and teat validities. The present research may stimulate other researchers so that they will pursue this 
topic further, taking different approaches. 

We notice that the test validity measure, proposed in this research can be modified by using one 
of the two mentation formulae, T(9) and E(6), of the test information function (cf. Chapter 3), in 
place of he onginj 1(8). This will be investigated in the future, when the characteristics of these two 
modification formulae have further been investigated and clarified. 
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VI Further Investigation of the Nonparametric Approach to 
the Estimation of the Operating Characteristics of Dis- 
crete Item Responses 

In the present research a method has been proposed which increases accuracies of estimation of 
the operating characteristics of ducrete item re.ponses, while pertaining to the two features described 
in Section 2.3, and the new procedure ha. been tested upon dichotomous items. It has proved to be 
effective, eepeciaUy when the true operating characteristic is represented by a steep curve, and also at 
the lower and upper ends of the ability distribution where the estimation tends to be inaccurate because 
of smaller number, of subjects involved in the base data. Tentatively, it is called the Differential Weight 
I™', , * 1 be, ° 6 * to the Conditional P.D.F. Approach (cf. Chapter 2). This procedure costs 
^ if u™ 10 S ' mple Sum Procedure - which h " been used frequently (cf. Samejima, 1981 

1988), but the advantage of handling more than one item, say, fifty, together in the Conditional PDf' 
Approach is still there. 
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[VI.l] Simple Sum Procedure of the Conditional P.D.F. Approach Com- 
bined with the Normal Approach Method 

It is obvioui from the discussion given in Chapter 2 that the Conditional P.D.F. Approach combined 
with the Normal Approach Method is the simplest and one of the most economical procedures in CPU 
time. Out of the three procedures of the Conditional P.D.F. Approach the Simple Sum Procedure is the 
simplest one (cf. Samejima, 1981). For this reason, the combination of the Simple Sum Procedure of 
the Conditional P.D.F. Approach and the Normal Approach Method has most frequently been applied 
for simulated and empirical data. Fortunately, in spite of the simplicity of the procedure, the results 
with simulated data in the adaptive testing situation and with simulated and empirical data in the 
paper-and-pencil testing situation indicate that we can estimate the operating characteristics fairly 
accurately by usin<; this combination (cf. Samejima, 1981, 1984). This seems to prove the robuttnta of 
the Conditional P.D.F. Approach. For one thing, there is a good reason why Normal Approach Method 
works well, for the conditional distribution of r , given f , is indeed normal if the (unconditional) 
distribution of r is normal, and it is a truncated normal distribution if the (unconditional) distribution 
of r is rectangular, and the truncation is negligible for most of the conditional distributions. 

In the Simple Sum Procedure of the Conditional P.D.F. Approach, the operating characteristic, 
Pk 9 ($) , of the discrete item response k g of an unknown item g is estimated through the formula 

A.W = - S>M '.)[£>(' I ft))- 1 , 

•€fc 0 ixl 

where $ (= 1, 2, . . . , N) indicates an individual examinee, and ^(r | f 0 ) denotes the conditional density 
of r , given f, . This conditional density is estimated by using the estimated conditional moments of 
r , given f 9 , using one of the four methods, as was described in Section 2.3. 

In the Weighted Sum Procedure of the Conditional P.D.F. Approach, we have for the estimated 
operating characteristic of k g 

(6.2) P k .(8) = P; t [r(6)} = £ »(f.W I MlX>(M*(' I Ml"' 

i=l 

where tv(f 9 ) is the weight function of ?, . When we combine one of these two approaches with 
the Normal Approach Method, ^(r | ?,) in (6.1) or in (6.2) is approximated by the normal density 
function, using the first two estimated conditional moments of r , given f, , which are given by (2.13) 
and (2.14), respectively, as its parameters, /i*, and <r* v , in the formula 

(6.3) *(r | f.) = [2irr 1/3 Kr 1 «xph(r-^.)V{2a? }| . 



[VI.2] Differential Weight Procedure 

If we accept the approximation of the conditional distribution of f , given r , by the asymptotic 
normality, as we do in these approaches (cf. Samejima, 1981), the other conditional distribution, i.e., 
that of r , given f , will become more or less incidental. Thus in the Bivariate P.D.F. Approach 
the bivariate distribution of r and f is approximated for each separate item score subpopulation of 
subjects of each unknown test item. In the Conditional P.D.F. Approach, however, the incidentality 
of this second conditional distribution is not rigorously considered, and the implicit assumption exists 
such that for the fixed value of f the conditional distributions of r are similar for the different item 
score subpopulations. 



ERLC 



45 5 2 



Thi§ Mtumption k not accenuhl* v 

(6.4) J , 

where /«( r ) indicate, the den.ity of r for tl. u 

common >tem ecore of item a ill I ,7 • .i "^Population of eubjects who .1. 

by the noma! density, 1 C~M ' V" ^ 0nditio »»ld«».ity of ^ „ /! v ^ *» ** their 
«d for which wehav? ' ' ° l 1 1 " d <W » margin J den.i ty7 f Z*X ^roximated 

i °« r , lor this subpopulation, 

(6.5) , /•« 

9k ' (t) = LjW +{<\r)iT . 
We notice that there i. a relation.hip 

(66) ^w-'*wn;wijr r(f)l ^ (r) ^. l ( 

*here /'(r) denote, the den.ity of r ft» * . , 

Y ° f ' f ° r the toUl Population. Since we have 

(6 ' 7) ^ r l f ) = /»^(f|r)[^ (fr i ( 

where ? *(f) i, the dengi f 

the total population of .abject, which is given by 
(6.8) r co 

from the above formulae we obtain 
(6.9) 

K('\t) = <Hr\f)p :t{r)h(f) t 
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where h(f) is a function of f and constant for a fixed value of f . Thus 4>k g (r | f ) is a density 
function proportional to ^(r | f) P£ (r) . We notice that ^(r | f) in this formula is common to t\\ the 
item scores and across different unknown items, while i£ f (r) is a specific function of r for each Jb y . 
Since ^(r | f ) can be estimated by one of the four methods described in Section 2.3, our effort should 
be focused on finding an appropriate differential weight function for each k g . Let W kg (r) denote such 
a diffenntial weight function, which replaces /* 9 (r) h(f) in (6.9). Thus we can revise (6.1) and (6.2) 
into the forms 

(6.10) P k ,(0) = P£ t [r(9)] = £ W k§ (r)4(r | *.)[£>, (r; ,)*(r | MP* 

•efc, «=i 

and 

N 

(6.11) A.W - /fc,|r(f)| = E I ME«W^(n»)*(^ I MP • 

Since the differential weight function W k§ (r) involves P£ (r) , which itself is the target of estimation, 
we may use its estimate, ^f # (r) , obtained by the Simple Sum Procedure or by the Weighted Sum 
Procedure, as its substitute. In so doing, we may need some local smoothings of P{ (r) where the 
estimation involves substantial amounts of error because of locally small numbers of subjects in the base 
data, etc. In some cases we may need several iterations by renewing the differential weight functions 
on each stage until the resulting estimated operating characteristic converges. 

[VI.3] Examples 

We have tried this proposed method on the simulated data provided by Dr. Charles Davis of the 
Office of Naval Research, using the Simple Sum Procedure of the Conditional P.D.F. Approach combined 
with the Normal Approach MetLod with some modifications as the initial estimate of Pk f {r) in the 
differential weight function. These data are simulated on-line item calibration data of the initial itempool 
calibration based upon conventional testing, in which 100 dichotomous items are divided into four 
subtests of 25 items each, and each subtest has been administered to 6,000 hypothetical examinees, 
and those of different rounds based upon adaptive testing, in which each of the 50 new binary items 
has been administered to a subgroup of 1, 500 hypothetical subjects out of the total of 15, 000 . These 
hypothetical examinees' ability distributes unimodally within the interval of $ , (-3.0, 3.0), with slight 
negative skewness. 

For the purpose of illustration, Figure 6-1 presents the results of the Differential Weight Procedure 
using the results of the Simple Sum Procedure of the Conditional P.D.F. Approach combined with the 
Normal Approach Method with some modifications the initial estimates, for a couple of items of 
the initial itempool. They are dichotomous items, and Here intentionally selected from those i f ems 
whose true operating characteristics of the correct answer are non-monotonic, in order to visualise the 
benefit of the nonparametrk estimation of the operating characteristic. In each graph, also presented 
for comparison is the best fitted operating characteristic of the correct answer following the three- 
parameter logistic model, which has been given by Dr. Mkuael Levine. We can see in these graphs that 
the resulting estimated operating characteristics are fairly close to the true ones, ana that they reflect 
the non-monotonicitie*. The reader is directed to ONR/RR-90-4 (Samejima, 1990) for more examples. 

[VL4] Sensitivities to Irregularities of Weight Functions 

As we have proceeded, several factors have been identified and observed which affect the resulting 
estimated operating characteristics substantially. They are concerned with the differential weight func- 
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FIGURE 6-1 

Tv> Examples of the Estimated Operating Characteristic of the Correct Answer 
Using the Differential Weight Procedure (Dotted Line), in Comparison with 
the True Operating Characteristic (Solid Line) and the Best Fitted 
Three-Parameter Logistic Curve (Dashed Line). 
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tion, and can be itemised as: l) lower end ambiguities, 2) upper end ambiguities, 3) local irregularities 
and 4) overall irregularities. 



Out of these factors, lower and upper end ambiguities basically come from the fact that we do not 
usually have sufficiently large numbers of subject* on the lowest and the highest ends of the interval 
of 9 of interest upon which the estimation of the operating characteristics is made. Also the fact that 
the test information function 1(9) is used in the transformation of 9 to r which is specified by 
(2.9) may have something to do with these ambiguities. It has been observed (Samejima, 1979b) that 
in using equivalent items following the Constant Information Model (Samejima, 1979a) the speed of 
convergence of the conditional distribution of the maximum likelihood estimate 6 , given 9 , to the 
asymptotic normality with 9 and [/(0)] -1 ' 9 as its two parameters substantially differs for different 
levels of 9 , in spite of the fact that the amount of test information is constant for every level of 6 . 
To be more specific, the convergence is observed to be much slower at those levels which are close to 
either end of the interval of 9 for which the amount of test information is non-sero and constant, and 
faster at intermediate levels of 9 . This situation can be ameliorated if we replace the test information 
function I[9) in (2.9) by one of its two modified forms (cf. Chapter 3), T(0) and 5(0) . 

By irregularity we mean non-smoothness, which is exemplified by an unnatural angle, etc. It has 
been observed that for most items the resulting operating characteristic is amasingly sensitive to these 
irregularities of the differential weight function. In order to observe these sensitivities, Figure 6-2 
illustrates how these irregularities, which are involved in the differential weight function, affect the 
resulting estimated operating characteristic. For more examples, the reader is directed to ONR/RR- 
90-4 (Samejima, 1990). 

The effect of local irregularities is most interesting to observe in the three examples presented by 
Figure 6-2. In each of these graphs, the artificially irregular differential weight function for the correct 
answer is drawn by a short dashed line, and, in order to emphasise its irregularities, it was proportionally 
enlarged and shown by a long dashed line. We can see in each graph that, when the differential weight 
function has an unnatural angle, for example, the resulting estimated operating characteristic of the 
correct answer also shows an unnatural angle at approximately the same level of 6 . We can also see in 
these graphs how overall irregularities of the differential weight function affect the resulting estimated 
operating characteristic, and how sensitive the latter is to the former. This type of sensitivity of the 
resulting estimated operating characteristic to the irregularities of the differential weight function is 
encouraging as well as threatening, for it promises success in the estimation provided that we succeed 
in finding the right differential weight function. 

During the present research period, perhaps the author and htr research assistants have spent the 
greatest amount of time for developing this method, Differential Weight Procedure of the Conditional 
P.D.F. Approach. Thus, in addition to the results exemplified in this section and in ONR/RR-90- 
4 (Samejima, 1990), there have been produced so many other results, using different strategies in 
specifying differential weight functions, etc. The research will be continued in the future, and those 
results which are not introduced in this final report will be included in the basis upon which the future 
research will be founded and planned, and will eventually be introduced in future research reports. 

[VI. 5] Discussion and Conclusions 

A new procedure of nonparametric estimation of the operating characteristics of discrete item re- 
*>onses has been proposed, which is called Differential Weight Procedure of the Conditional P.D F 
Approach. Some examples have been given, and sensitivities of the resulting estimated operating char- 
acteristics to irregularities of the differential weight functions have been observed and discussed. Those 
outcomes suggest the importance of further investigation of the weight function in the future. 

To summarise, although Simple Sum Procedure of the Conditional P.D.F. Approach combined with 
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Three Examples of the Estimated Operating Characteristic of the Correct Answer 
Using the Differential Weight Procedure (Dotted Line), in Comparison with the 
True Operating Characteristic (Solid Line), When the Differential Weight 
Function (Short Dashed Line) Has Irregularities. The Function Was Also 
Proportionally Enlarged and Plotted (Long Dashed Line) to Visualise 
the Angles and Other Irregularities Well. 
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the Normal Approach Method works reasonably well for the on-line item calibration of adaptive testing, 
and also for the paper-and-pencil testing, especially when the number of subjects is large, if we wish 
to increase the accuracy of estimation we can use the Differential Weight Procedure. The du^a vantage 
will be the added CPU time, so we need to consider the balance of the cost and accuracy of estimation 
before we make our decision. It will be less expensive, however, if we compare the CPU time required 
for the present procedure with the time required for the Bivariate P.D.F. Approach. 
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VII Content-Based Observation of Informative Distractors 
and Efficiency of Ability Estimation 

Partly because of the availability of computer software, such as Legist (Wingersky, Barton and Lord, 
1982), Bilog (Bock and Atkin, 1981), etc., it is a common procedure among researchers that they mold 
the operating characteristics of correct answers into the three-parameter logistic model, ignoring their 
possible non-monotonicity. In some cases, strategies are even taken so that distractors, which cause 
the non-monotonicity, are considered as undesirable and are replaced by some other non-threatening 
alternative answers. 

A qu««tion must be raised as to whether this strategy is wise. In this chapter, this issue will be 
discussed both from theory and from practice, and a new strategy of writing test items, which leads 
to more efficient ability estimation, will be proposed. It will take advantage of the ease in handling 
mathematics attributed to parameterisation, and yet minimise the effect of noise caused by random 
guessing. 



[VII.l] Non-Monotonicity of the Conditional Probability of the Positive 
Response, Given Latent Variable 

This section deals basically with the essence or a summary of the paper published by the author 
more than twenty years ago (Samejima, 1968), as one of the research reports of the L. L. Thurstone 
Psychometric Laboratory of the University of North Carolina. The content of the paper was a protocol 
which led to the proposal of a new family of models for the multiple-choice test item (Samejima 
1979b). The author believes that this paper published in 1968 still gives new ideas to today's research 
communities. 

The paper is concerned with the nominal response, and also multiple-choice situations, in which 
examinees are required to choose one of the given alternatives, in connection with the graded response 
model (cf. Samejima, 1969, 1972). For a multiple-choice item a certain number of false answers are given 
in addition to the correct answer. In a general case it is impossible to score them in a graded manner in 
accordance with their degrees of attainment toward the goal. Thus the multiple-choice situation should 
be treated as a special instance of the nominal level of response, although, in addition, the problem of 
random or irrational choice should be investigated. 

Confining discussions to examinees who have responded to item g incorrectly, there can be diversity 
of false answers if they have responded to it freely, without being forced to choose one of a set of 
alternative answers. It is conceivable that some of the false answers may require high levels of ability 
measured while some others may not, some may be related to the ability measured strongly while some 
others may not, etc. An objective measure of the plausibility of a specified false answer is its operating 
characteristic, i.e., the probability of its occurrence defined for a fixed value of ability $ , and, therefore 
expressed as a function of 6 . ' ' 

Let M,(6) be a sequence of the conditional probabilities corresponding to the cognitive subprocesses 
required in finding the pla-iibility of response k g to item g , and U kf [9) be the conditional probability 
that an examinee discovers the irrationality of response k g as the answer to item g , on condition 
that he has already found out its plausibility. The operating characteristic of k g , which is denoted by 
Pk,[6) , can be expressed by 

( 71 ) A.WHi-^WIIIm.W , 



since it is reasonably assumed that an examinee who gives a response Ar tf to item g is one who has 
succeeded in finding k g '$ plausibility, and yet failed in finding its irrationality. V 



ing its irrationality. We notice that this 
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formula ii exactly the same in its structure as the definition of P Zg (B) on the graded response level, 
where M 9 (B) is replaced by M % (6) and U kt (6) is replaced by M {Sg + x) (6) (cf. Samejima, 1972). 
Defining M kt (6) such that 



(7.2) M k ,(6)=l[M.(6) , 

ftfc f 

we can rewrite (7.1) into 

(7-3) PkM) = M k ,(6)\l-U kt {B)\ . 

It will reasonably be assumed from their definitions that both M kf (6) and U kf (6) be strictly 
increasing in 6 , provided that a specified response k g is a good mistake in the sense that the 
discoveries of its plausibility and irrationality are properly related with ability 6 . It will also be 
reasonably assumed that the upper asymptotes of M kf (6) and U kf (6) are unity, and the lower 
asymptote of M* f (6) is sero. 

We assume that both M kf {6) and U kt {$) are three-times-differentiable with respect to 6 . It ia 
easily observed that, in order to satisfy the unique maximum condition (Samejima, 1969, 1972), P k (6) 
defined by (7.3) must fulfill the following inequalities: 

( 74 ) ^°*M k ,{6) = ±l±M kt (6){M k ,(6))->} < 0 

and 

(7-5) ^o i [l-U kt (6)} = ^[-±U kt (6){l-U kt (e))-'}<0 . 

(For proof, see Samejima, 1968.) Note that in this case the lower asymptote of U kg (6) need not be 
sero. The operating characteristic of a specified response k g which satisfies the unique maximum 
condition was called the plausibility curve (Samejima, 1968), and later the plausibility function (cf. 
Samejima, 1984a}. As the condition suggests, the plausibility curve is necessarily unimodal. A schema- 
tised hypothesis for the plausibility curve is the following. The probability that an examinee will find 
the plausibility, but will fail in discovering the irrationality, of a specified response k g as the answer 
to item g is a function of ability 6 ; it increases as ability 6 increases, reaches maximum at a certain 
value of 6 , and then decreases afterwards. If an item provides many such responses, their plausibility 
curves will be powerful sources of information in estimating examinees' abilities. That is to say, we can 
make use of specific wrong answers to an item as sources of information, as well as the correct answer. 

Let P 0 (B) denote the operating characteristic of the correct answer of a dichotomous item g in 
the free-response situation. Let P*(B) be the same function, but in the multiple-choice situation. The 
conventional three- parameter model is represented by 

(7-6) W = e a + (l- e „)P,(«) , 

where c g is the probability with which an examinee will guess correctly (Lord and Novick, 1968). 
This is a monotonically increasing function of 6 with c g (> 0) and unity as its lower and upper 
asymptotes, provided that P g (6) is strictly increasing in 6 with sero and unity as its lower and upper 
asymptotes. 
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(7.9) 



This k somewhat similar to formula (7.6), the conventional functional formula for the operating 
characteristic of the correct answer of a multiple-choice item. The lower asymptote of the present 
function is d g e g (< e g ) , however, while it is c c in (7.6); the upper asymptote of the present function is 
[1 - dg(l - c g )\ , which can he k$$ than unity, while it is unity in (7.6). In a special case where d g = 0 , 
that is, an examinee tries to solve item g by proper reasoning with probability one, (7.9) reduces 
to P g (9) , the operating characteristic of the correct answer in the free-response situation. In another 
special case where dg = 1 , that is, an examinee depends upon random guessing with probability one, 
(7.W) reduces to a constant, c g . In the more general case where d g (B) varies as 6 varies, it is observed 
from (7.S) that 

(710) ( P;(0) = c* = PA<>) l if *-«d 

C9<P;(8)<Pg(8)<l ; •/ *>*o 

where 

(711) tfo = P 9 - , (c.) , 

provided that c g is greater than sero. This result is quite natural, since it is reasonably assumed that 
the probability of success in solving item g will decrease by random guessing if the one attained by 
the due cognitive process is higher than the one attained by random guessing, and it will increase by 
random guessing if the latter probability is higher than the former. If we assume that the asymptotes 
of <^(0) in negative and positive directions be unity and sero, respectively, we wiU obtain e g and 
unity as the lower and upper asymptotes of P*(6) . Figure 7-1 presents two examples of the operating 
characteristic given by (7.8) where e g is 0.2 , using two different d g {6) 's . Note that there is a 
dip on the lower part of the curves for P*(6) . These two d g {6) 's are identical for the lower levels of 
6 , but differ on the upper levels, with the upper asymptotes 0.0 and 0.1 , respectively. In these 
examples, therefore, the upper asymptote of P*(8) is unity in the first example, and 0.92 in the 
second, i.e., the conditional probability for the correct answer never approaches unity however high the 
ability may be. 

If d g (6) is differentiate, P*(6) is also different iable, and from (7.8) we have 

(712) ±p;(6) = [1 - d g (6)]±P g (6) + [c g - P g (6)]§jd g (6) . 

Thus it is obvious that P*(B) is strictly increasing in 6 for the range 6 > 6 0 , if, and only if, d g (B) 
is less than unity for the range of 6 satisfying B > 6 0 . Thus in this case P g (6) is non-decreasing in 
6 throughout its whole range. In general, P* (6) equals c g and presents a horizontal line as far as 
d g (8) is unity, and then increases for the rest of the range as 6 increases. 

As for the range expressed by 6 < B 0 , P*(B) equals e g regardless of the value of P g (6) for the 
values of 6 for which d g (B) is unity, and is some positive value less than e g otherwise. If d g (9) is 
unity throughout this range of 6 , P*(6) presents a horisontal line for this range. If d g (B) is unity 
for the negative extreme value of 0 , but d g (0) takes on some values less than unity for a subset of 
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* of this range P;(0) has at least one local minimum. If d 9 (0) is less than unity for the negative 
extreme value of * , ;*(*) can be strictly increasing in 6 , non-decreasing, or have one or more local 
minima, in accordance with the functional formula for d a (0) . 



It is obvious that any operating characteristic having local minima does not satisfy the „, 
maximum condition (Samejima, 1969, 1972), and neither does the one whose first derivative equals 
at some value of I . In the case of P;(6) defined by (7.8) we can prove that, in general, it doe t 
satisfy the unique maximum condition, even if it is strictly increasing in 6 
1966.) 



unique 
sero 
doe, not 

(For proof, see Samejima, 



Two characteristics of the model represented by (7.8) are that it allows dip,, and also a smaller value 
than unity for the upper asymptote of the operating characteristic of the correct answer, as Figure 7-1 
illustrates. In these examples, there is only one dip on the lower level of 0 . There can be more than 
one, however, and an example is presetted elsewhere (Samejima, 1968). In many cases the model may 
describe the real operating characteristic of the correct answer more closely than the three-parameter 
model. 

It has been reported by several researchers that they have come across estimated operating charac- 
teristics of correct answers that do not converge to unity, but to some other value, leu than unity. Note 
that the general model described above can handle such situations, although most of the other models 
proposed by different researchers so far cannot. 

We notice that neither (7.6) nor (7.8) explicitly takes into consideration the influences of separate 
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distractors. Suppose an examinee A has chosen to solve item g by reasoning, i.e., without guessing, and 
has reached an answer which is not correct. Suppose, further, that this specified response is not given 
as an alternative answer to this item. Then either he will decide to give an answer by guessing, or he 
will try to solve the item by reasoning all over again. To account for these possibilities, we would have 
to give practically all the different plausible responses to item g as its alternatives, which is practically 
impossible, since the number of alternative answers is more or less restricted. In contrast to this, it 
is interesting to note that the psychological hypothesis behind the three-parameter logistic model may 
be more realistic in the case where no very plausible responses except for the correct answer to item 
g are given as its alternative answers. Thus, even if an examinee has reached a specified plausible 
response other than the correct answer, he may turn to random guessing simply because he cannot find 
that specified answer among the alternatives. Such a situation has another serious problem, however, 
since it is likely for an examinee who is highly alternative-oriented to choose the correct aniwer without 
much reasoning or guessing, simply because the other alternatives are too ridiculous to be the answer 
to the item. As the result, the operating characteristic of the correct answer may be deformed so that 
it has a lower difficulty and less discriminating power. Plausible answers as distractors are necessary as 
alternatives in order not to destroy the nature of the item. 

It is conceivable that the plausibilities of the alternatives attached to item g other than the correct 
answer will be one of the factors affecting the probability of random guessing in the multiple-choice 
situation. For this reason, here we shall suppose that an examinee will try to solve the item following 
proper cognitive processes at the beginning, and only in the case where he has reached an answer which 
is not given as an alternative, or where he has failed to find any answer at all, he will guess. 

Let k g or kg denote a specified response to item g which is given as an alternative, including the 
correct answer, and ft f (0) or J\(0) be its operating characteristic in the free-response situation. 
It may reasonably be assumed that J2 fc p k 9 (0) is less than or equal to unity for any fixed value of 
$ . Let Pt{9) or P£ 9 (9) denote the operating characteristic of a specified alternative k g or h Q in 
the multiple-choice situation, and c fcf or c hf be the probability of choosing k g or h g by guessing, 
which satisfies 

(7-13) J>. = 1 

Thus we can write 

(7-14) P:,(6) = P k ,(6) + (1 - £ P hf (6)\ c k , 

h, 

for any k g , and, by using the notation for the correct answer as we did in the previous sections, we 
obtain 

(7.15) W = PfW + [i-Eft.WK • 

It is worth noting that we have specified not only the operating characteristic of the correct answer 
in the multiple-choice situation, but also of each distractor. The utility of the operating characteristic 
of each wrong alternative answer in the estimation of an examinee's ability, as well as the one of the 
correct response, is suggested, ond this is a feature of the present discussion. 

It has been made clear that, in general, Pg(0) dots not satisfy the unique maximum condition 
regardless of the functional formulae for the plausibility curves of the distractors. As for the alternatives 
other than the correct answer, it can easily be shown that, in general, P fc * (0) does not satisfy the 
unique maximum condition (cf. Samejima, 1968, 1979b). 9 
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( |Sfr k w J^, 0 ™ 1 Answer in the Free-Response Situation 

(Solid Line) and in the Multiple-Choice Situation (Dashed Line), in the Caw 

J7u ?i AhernttiVei Given; Alao the Operating Characteristic 

of the Other Alternative in the Free-Response Situation (Solid Line) I. 
Plotted from the Ceiling; c g = c* # = 0.5 . 
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the correct answer and one incorrect mpowe, are given. In thi. example, P; ($) for the wrong answer 
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with a, = 1/1 48 and 4, = 0.36 k used u the operating characteristic of the correct answer and the 
The value of c g , aa well a. that of c k , for the incorrect answer, i. 0.5 * 1 ' ' 

^AhlS^^ thlt theM « the Philoaophie. 

SS. TkLI iflT? J ne ^ flmUy .°f ""'fc^ for the -nultiple-choice teat item (Samejima, 
2SJ?S P ^T? r WI ? P " WUh thC id " of «ntent.b»ed ohaervation of Normative 
^ «d -trategie. of writing teat item., which will be P ropo.ed in a later .ection The wneral 

Model, to which the three-parameter model repre.ented by (7.6) belong, (cf. Samejima, 1979b). 
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[VII.2] Effect of Noise in the Three-Parameter Logistic Model and the 
Meanings of the Difficulty and Discrimination Parameters 



It if still a common procedure among researchers to adopt the three-parameter logistic model, which 
is represented by (3.11) in Section 3.2, for their multiple-choice test items and compare the resulting 
estimated discrimination parameters, or the difficulty parameters, across different items. An important 
fact that is overlooked is that this is not legitimate, for the addition of the third parameter c g makes 
the other two item parameters lose their original meanings. If a g = 1.00 and c g = 0.25 in the 
three-parameter logistic model, for example, this corresponds to a g = 0.75 in 'lie logistic model in 
the maximum discrimination power. If, in addition to these parameter values, b g = 0.00 , then the 
difficulty level for the three-parameter logistic model defined as the level of $ at which chances for 
success are 0.5 is -0.4077336 , i.e., substantially lower than 0.00 . 

In general, we can write 



where a g denotes the actual discrimination power and f) g is the actual difficulty level in the three- 



substantial, both on the discrimination power a g and on the difficulty index fi g . Thus the simple 
comparison of the values of a g for two or more test items having different values of the lower asymptote 
c g is illegitimate and can be harmful, for the factor (1 - c g ) may affect the value of a g , the real 
discrimination power, substantially. As for the difficulty index, since the second term on the right hand 
side of the second equation of (7.17) is always negative for 0 < c g < 0.5 , this term represents the 
amount of decrement of the difficulty level. Note that as e g tends to 0.5 , f) g approaches negative 
infinity! (If c g > 0.5 then fi g does not even exist.) The illegitimacy of, and the danger in, comparing 
b g 's across two or more test items having different lower asymptotes c g is even more obvious for the 
difficulty index. 

It is obvious from theory that in both the logistic and the three-parameter logistic models the 
derivative of the operating characteristic of the correct answer is highest at $ = b g . Actually, the 
derivatives are: Da g /4 and (l - e g )Da g /4 , respectively. The ratio of this maximal slope between the 
three-parameter logistic model and the logistic model is (l - e g ) , which equals 0.75 when c g ~ 0.25 , 
and is as low as 0.50 when e g = 0.50 . The corresponding ratio between the three- parameter logistic 
model and the normal ogive model is approximately 0.938687718(1 - e g ) , which is a little less than 



Figure 7-3 illustrates 'hat several sets of substantially different parameter values in the three- 
parameter logistic model can produce very similar operating characteristics of the correct answer. We 
can tell that the differences in the values of the discrimination and difficulty parameters for these items 
are substantial, and yet the resulting curves are very close to each other for a wide range of $ . Simple 
comparison of the two estimated discrimination parameters is illegitimate, therefore, when the estimated 
guessing parameters prove to be different from each other, as is usually the case with actual data. Since 
the estimation of the third parameter c g tends to be most inaccurate, this example indicates the dan- 
ger in direct comparisons of the estimated discrimination parameters, and also the estimated difficulty 
parameters, across the items. 

In most cases the estimated guessing parameter of a multiple-choice test item provides us with some 
other value than the reciprocal of the number of the alternative answers. It is reported that in some cases 
the estimated c g takes on quite high values (cf. Lord, 1980, Section 2.2). These phenomena suggest 
that the philosophy behind the model is unrealistic. Researchers using the three- parameter logistic 
model argue, however, that it still is a convenient approximation to real operating characteristics of 



(7.17) 



i 



P g = b t + (Do, )"» log (1 - 2c„) , 
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FIGURE 7-3 



Examples of the Operating Characteristics of the Correct Answer in the Three-Parameter Logistic 
Model (Dotted Lines), Together with the One in the Logistic Model with a g = 1.00 and 
b g - -0.64 (Solid Line). The Parameters for the Four Functions in the Order of a a , b g 
and c f are: 1.05, -0.52, 0.10; 1.10, -0.40, 0.20; 1.15, -0.27, 0.30; 1.20, -0.13, 0.40; 

Respectively. 



correct answers, because of its simplicity in mathematics. In a way it is true. The effective use of 
the three-parameter model cannot be realised, however, unless we know the problems attributed to 
the model, and nse the model in such a way that these weaknesses will not cause too much noise and 
inefficiency. 

Investigation of the problems encountered when we apply the three-parameter logistic model to the 
data which actually follow the normal ogive model was made earlier (Samejima, 1984b). The data 
used in the study are simulated data for two samples of 500 and 2,000 hypothetical examinees, 
respectively, sampled from the uniform ability distribution for the interval of $ t (-2.5, 2.5). In order 
to investigate the effect of the number of test items on the resultant estimated parameters obtained by 
Logist 5, we used: 1) Ten Item Test and 2) Thirty-Five Item Test, both of which consist of binary items 
following the normal ogive model. The response pattern for each hypothetical subject was produced 
by the Monte Carlo Method. Combining these two hypothetical tests, we observed the results of: 3) 
Forty-Five Item Test, and, in addition, we observed the results of rather artificially created: 4) Eighty 
Item Test (cf. Samejima, 1984b). 

These results suggest that there exists a substantial effect of the assumed third parameter, c g , on the 
other two estimated item parameters, if the estimation is made by molding the operating characteristic 
of the correct answer into that of the three-parameter logistic model, when actually it follows the normal 
ogive model. This effect appears to be stronger on the estimated discrimination parameter than on the 
estimated difficulty parameter. In order to amend these enhancements, the discrimination shrinkage 



fictor and the difficulty reduction index were proposed (Samejima, 1984b) by formulae (7.19) and (7.21), 
respectively. 



(7.18) ^ = { ( C ;) a g . 

r.w) *(<;) = - io g (i - 2c;) k>g( i + c* g ) - io g (i - c ;r i . 

(7.20) *; = *a+eKia a ) . 

(7.21) tfc; I a g ) = (Da,)" 1 k>g(l + cj) - log(l - c* g ) . 



In these formulae, o* , 6J , and c* indicate the estimated item discriminat ion, difficulty and guessing 
parameters when the three-parameter logistic model is assumed, respectively. Some resulting estimated 
operating characteristics of the correct answer turned out to be disastrously different from the theo- 
retical functions, especially when only ten binary test items were included. We find no substantial 
differences between the results of 500 Subject Case and 2,000 Subject Case, indicating that increasing 
the number of subjects from 500 to 2,000 does not provide us with a substantial gain. 

It has been pointed out that the three-parameter logistic model does not satisfy the unique maximum 
condition for the likelihood function, and this topic has been thoroughly discussed (Samejima, 1973). 
The expected loss of item information for a fixed value of $ is given by 

(7.22) 1,(0) - r g (9) = c^»^|{^(#)>»{1-^(#)>1|^ + , 

where 

(723) *,(#) = ll + *x»{-Da,li$)-b § )}]- 1 . 

and I g (6) and I* (6) are the item information functions in the logistic and the three-parameter logistic 
models, respectively. We have for the critical value , below which the information provided by the 
correct answer to the item following the three-parameter logistic model assumes negative values 

(7.24) fi, = b 0 + (Wa g y l log c Q , 

which is strictly increasing with the increase in the parameter value e g , and also in a g and in b g . 
If, for example, a g = LOO and b g = 0.00 , ^ = -0.473364 for c g = 0.20 , and ^ = -0.407734 tor 
c g = 0.25 . They are considerably high values relative to b g . 

An important implication is that is the point of $ below which the existence of a unique 
maximum likelihood estimate is not assured for all the response patterns which include the correct 
answer to item g . Although this warning has been ignored by most researchers for many years, a 
recent research (Yen, Burket and Sykes, in press) points out this is happening much more often than 
people might think. 

It has been pointed out (Samejima, 1979a, 1982a) that there is a certain constancy in the total 
amount of item information, regardless of the parameter values and of specific functional formulae for 
the operating characteristic of the correct answer. If, for example, the model belongs to Type A, i.e., 
the operating characteristic of the correct answer is monotone increasing with sero and unity as its 
lower and upper asymptotes, respectively, then the total area under the curve of the square root of the 
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item information function will equai ir . If the model belongs to Type B, i.e., the same a* Type A 
except that the lower asymptote of the operating characteristic of the correct answer is greater tluji 
sero, as is the case with the three-parameter logistic model, then the total area will become 

( 7 -25) * - 2tan- 1 [c B (l-c„)- 1 ] 1/2 , 

with the second and last term as the lots in the amount of total item information. This last term 
is strictly a function of c„ . When c g = 0.20 , for example, the total amount of item information 
reduces, approximately, to 0.705* , and when c g = ( .25 it is approximately equal to 0.667* . 
More observations concerning the efTect of noise in the thi«e-parameter logistic model have been made 
elsewhere (Sfmejima, 1982b). 

As all the above obser ations indicate, the addition of the third parameter, c„ , to the logistic model 
creates many negative results. We have seen that these negative effects are greater for larger values of 
eg In using the three-parameter logistic model as an approximation to real operating characteristics, 
therefore, we need to take these facte into consideration. Among othtr*. if we are in a situation where 
we can modify or revise our items, we must try to reduce the effect of noie* -c.ning from c g aa much 
as possible. Strategies of writing the multiple-choice test items must be considered accordingly. 

[VII.3] Informative Distractors of the Multiple-Choice Test Item 

So far most observations and discussion have been focused on theory. Applications of certain non- 
parametric methods of estimating the orerating characteristics for some empirical data have revealed 
however, that many multiple-choice test items do not follow the three- parameter model, nor do they 
follow the Equivalent Distractor Model in general, to which the three-parameter logistic model belongs. 
Those items can best be interpreted by the Informative Distractor Model. 

Figure 7-4 presents an example of the set of operating characteristics of the four alternative answers 
to an item taken from the Level 11 Vocabulary Subtest of the Iowa Tests of Basic Skills (Samejima 
1984a), which was estimated by the Simple Sum Procedure of the Conditional P D F. Approach com- 
bined with the Normal Approach Method (cf. Section 6.1). We can see in this figure that each distractor 
has its own unique operating characteristic, or plausibility function, and also that the estimated oper- 
ating characteristic of the correct answer is fairly close to the one in the normal ogive model, which 
is drawn by a solid line in the figure. This set of operating characteristics can better be represented 
by one of the family of models proposed for the multiple-choice test item, which was originated by the 
philosophy described in the preceding section and takes account of the unique information provided by 
each distractor as well as the effect of the examinees' random guessing behavior (cf. Samejima, 1979b). 
Figure 7-5 illustrates the operating characteristic of the correct answer in Model A. We can see that 
it is very close to the one in the normal ogive model which is drawn by a dotted line, except for the 
lower part of the curve, the conditional probability of success which is almost entirely caused by random 
guessing. In cases like this, it will be wise to approximate the curve by the normal ogive function by 
discarding the item response in estimating lower ability, since it provides us with nothing but noise, as 
was discussed in the preceding section. 

Detailed obesrvations for the plausibility functions of distractors are made elsewhere (Samejima, 
1984a) for the forty-three items of the Level 11 Vocabulary Subtest of the Iowa Tests of Basic Skills. 
Similar discoveries have also been reported with respect to many ASVaB test items. In those results, 
it is clet- that separate wrong answers given as alternatives provide us with differential information, 
which can be useful in ability estimation in the sense that it will substantially increase the accuracy of 
estimation. 
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FIGURE 7-4 



Example of the Estimated Operating Characteristics of the Correct Answer (Dotted Line) 
and of the Three Diitracton (Dashed Linea) Obtained by the Simple Sum Procedure of 
the Conditional P.D.F. Approach Combined with the Normal Approach Method 
Together with the One for the Comet Anawer Obtained by Aaauming the 
Normal Ogive Model (Solid Line) Taken from the Level 11 Vocabulary 
Subtaat of the Iowa Teata of Baaic Skills. 
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FIGURE 7-5 



Example of the Operating Characteristic of the Correct Answer in Model A (Solid Line) 
Together with One in the Normal Ogive Model (Dotted Line). 
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[VH.4] Merits of the Nonparametric Approach for the Identification of 
Informative Distractori and for the Estimation of the Operating 
Characteristics of an Item 

Method. «d approve, developed for estimating the operating characteristic, of discrete item 
STT r J° Ut *" Umm ? " y . mathem ' tic8j fo ™ ( cf - Section 2.3; Samejima, 1981, 1990) enable us to 
I th, H tZtZ ? ?T dteniltiVe * *° * item L informative 

t0 th ! mCremen * a * he " CVnc * » the " tim » ti0 « of the individual's 
I rS n j \ »V» Pr ° P0,ed * n6W 'PP™**, which is called Differential Weight Procedure 

we n^m^ W ApPrOKh ' » d fi whkh h " «— » preceding chapter. Although 

we need more r^earch for improving the fitness further, thow result, obtained to far give u. promise, 
for .ucce.. in .dentify^g informative di.tr.ctor. and in e.timating their operating ch.racteri.tic.. 

Jt ! !°" g ^ , *^ ing from the clM8k » 1 P r °P OTtion "rrect and item-te.t regres- 

have IvMeduT t K Tk ' ^ ^""^ *• -formation functions 

thUcZo™ R " P TSf 5?" N r ^c" 1 * he ******* ^ion. of the distractor. to 
an"wer a^hf ^ . 11?* £ COn6 *™ tio » of *»" «P««N characterise, of the correct 
Z^tSn^L 7fS ' ^ be lbfe \° U ° de " Und the characterise, of the item, it. .trengths 
iT^rr^ * T^" 4 ™ 8 th « «=« °e *<>»• if nece*^. Succe.sful nonpara- 

metric method, of estimating the operatmg characterise, are e..ential, therefore, for thi. new, more 
informative approach to the item analy.i.. ' 

[VII.5] Efficiency in Ability Estimation and Strategies of Writing Test 
Items 

we^w^ir. ^ fT7°l ""J T " CtioM *™ U " mUch U8eful informs as 

Si i *• *, u ' thtOTttKal observation, indicate that non-monotonieity of the operating 

£2£Tk K 7^ *° multi P 1 - choi « »•» " * natural con.eauence of theo. y 

J ^1 'J. 0 ™ fr ° m 1 MVer t 1 . diffCTent M « lM that the thW Parameter, Ctf , in the threl 
parameter model provide, u. with nothing but nou*; the greater the vilue of c a the more nrise 

T< r: raC, "J n Mt r atl ° n * J"*""- ™fc i4 "a. Pointed out that, 'although it is . „ 
a common procedure for marcher, to mold the operating characteristic, of the correct answers of 

loohJJ U t P T *?* ^P—"- -del, some nonparametric Znods 

apphed to emp.mji data have revealed the non-monotonicity of the operating characteristic of the 
correct an.wer with many actual te.t item,, a. well a. differential information provided by separate 
ita^chat? POinted ° Ut tHat "«*™»«* approach to'the estimaUonof th 

7S SSrSC?" f"" 4 , 8 rMPOn " 8 h " 8UCCe88ful en ° Ugh t0 detect the non-monotonicity 
of the function when it exut., and to approximate their rather irregular curves fairly accurately. 

new^r.te a gi« he8e " " COnVentiona, atrat W" ^ «™ ^Ung and to propose 

k"* We . n " d t0 r 1 eCon8ider » th e lack of sufficient interaction, between theorists and 
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.hall be able to improve the te.t, and the improvement will , ea d to efficiency in ability estimation 

The .econd thing we need to reconsider i, the simpleminded avoidance of non-monotonicity of the 
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FIGURE 7-6 

Operating Characteristics of the Five Alternative Answers of a Hypothetical Test Item 
Following Model B, with the Parameter Values: a g = 1.5 , 6i = -2.0 , 6a = -1.0 , 
i 3 = 0.0, 6 4 = 1.0 and 6 6 = 2.0 . 
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FIGURE 7-7 



Operating Characteristics of the Five Alternative Answers of a Hypothetical Test Item 
in the Free-Response Situation Following the Logistic Model on the Graded Response 
Level, with the Parameter Values: a g = 1.5 , b x = -2.0 , 63 = -1.0 , 
i 3 = 0.0 , 6 4 = 1.0 and 65 = 2.0 . 
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FIGURE 7-8 

Operating Characteristics of the Correct Answer Obtained by the Five Different 
Redichotomisations of the Graded Teat Item Following the Logistic Model, with 
the Discrimination Parameter, a, = 1.5 , and the Difficulty Parameters, 
6 X = -2.0 , b 7 = -1.0 , fc 3 = 0.0 , b 4 = 1.0 and 6 5 = 2.0 , 
Respectively. 



operating characteristic of the correct answer. While it is not desirable for an item to have higher 
conditional probabilities of the coiTect answer on lower levels of ability than on h \tr levels, selecting 
alternative answers so that the dips of the operating characteristic of the correct nswer be smoothed 
out will lead to a substantially large value of the lower asymptote of the operating characteristic in most 
ca*s. We must recall that even a smaU number like 0.2 as c„ in the three-parameter logistic model 
is a big nuisance, as was discussed in Section 7.2. Our strategy must be that we make the best use of 
those dips, instead of avoiding them. 

Figure 7-6 presents the operating characteristics of the five alternative answers of a hypothesized 
test item following Model B (Samejima, 1979b), with the parameter values: o„ = 1.50 , bi = -2.00 , 
b 2 = -1.00 , ij = 0.00 , b t = 1.00 and b b = 2.00 . The subscript for each of the five difficulty 
parameters indicates the order of easiness for the examinee to be attracted to the plausibility of each 
alternative answer, so that, in this example, 6 6 indicates the difficulty parameter of the correct 
answer. We can see in this figure that a practical monotonicity exists for the operating characteristic 
of the correct answer for the range of 6 , (-0.5, oo) , and, more importantly, within this range of 6 
its lower asymptote is very close to sero, i.e., the nuisance caused by the non-sero lower asymptote will 
be gone as far as we administer the item to populations of subjects whose ability distributes on lusher 
levels than 8 = -0.5 . 

Theae operating characteristics of the five alternative answers in Figure 7-6 are originated from those 
in the logistic model on the graded response level (Samejima, 1969, 1972) with the same parameter values 
(cf. Samejima, 1979b). Figure 7-7 presents the corresponding set of operating characteristics of the 
correct answers in the logistic model. We notice there is an additional strictly decreasing curve in this 
figure. This curve represents the conditional probability, given 8 , that the examinee does not find 
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attractiveness in any alternative answers. In Model B, these people are assumed to guess randomly, so 
in Figure 7-6 this curve does not exist, and the conditional probability is evenly distributed among the 
five alternative answers to account for the rises in their operating characteristics at lower levels of 6 . 
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Figure 7-8 presents the operating characteristics of the correct answer following the logistic model 
on the dichotomous response level, which are obtained by the five different redichotomisations of the 
graded test item exemplified in Figure 7-7. In these functions, a g = 1.5 is the common discrimination 
parameter, and the difficulty parameters are: b g = -2.0 ,-1.0 ,0.0 , 1.0 ,2.0 , respectively. This is the 
starting point of the graded response model, which leads to the operating characteristics illustrated in 
Figure 7-7 (cf. Samejima, 1969, 1972). 

Suppose that two alternative answers which attract examinees of low levels of 6 are replaced, 
and the revised item has b x = -3.0 and 6 a = -1.5 , respectively. In this situation, the operating 
characteristics of the correct answer obtained by the first two redichotomisations are changed. Figure 
7-9 presents the set of operating characteristics for this revised test item following Model B. In this 
figure we can see that the operating characteristic of the correct answer is practically strictly increasing 
within the range of 6 , (— 1.7, oo) , and the pseudo lower asymptote of the operating characteristic 
within this range of 6 is still very close to sero. 

A big gain resulting from Uis revision is the fact that the lower endpoint of the interval of 6 in which 
the operating characteristic of the correct answer is practically monotonic has substantially shifted to 
the negative direction, while still keeping its lower asymptote practically sero. Thus we can avoid the 
noise coming from the lower asymptote even if we administer the item to populations of examinees 
whose ability distributions are located on lower levels of 6 . In other words, without sacrificing the 
accuracy of ability estimation, the utility of the item has been substantially enhanced by this revision. 

The above example suggests the following strategy. 

(1) If the nonparametrically estimated operating characteristic of the correct answer to 
an item provides us with a relatively high value of 6 below which monotonicity does 
not exist, then change the set of distractors to include one or more wrong answers 
that attract examinees of very low levels of ability. 

It may sound difficult to do in practice. If we pay attention to actually used multiple-choice test 
items, however, we will come across many wrong alternative answers that are attracting examinees of 
very low levels of ability. To give an example, the author has come across an arithmetic item asking for 
the area of a rectangle. A substantial number of seventh graders chose the wrong alternative answer 
which equals the sum of the two sides of the rectangle of different lengths! It is obvious that those who 
did not understand how to obtain the area of a rectangle at all chose this alternative answer. 

Another consideration which is important in writing test items is to keep the pseudo lower asymptote 
of the operating characteristic of the correct answer close enough to zero, as is the case with the above 
example. This has a great deal to do with the discrimination powers of the alternative answers, as well as 
the configuration of the plausibility functions. Figure 7-10 presents the set of operating characteristics 
corresponding to Figure 7-6, by changing the discrimination parameter from a g = 1.5 to a„ = 1.0 , 
while keeping the five difficulty parameters unchanged. If we compare Figure 7-10 with Figure 7-6, we 
can see a substantial enhancement of the pseudo lower asymptote within the interval of B , (-0.5, oo) , 
i.e., the nuisance has been increased by the change in the discrimination parameter. 

This suggests the second strategy: 

(2) If possible, try to include distractors whose estimated operating characteristics are 
steep, while keeping the differential configuration of these functions as suggested in (1). 

So far our strategies have been focused upon producing an informative operating characteristic of 
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FIGURE 7-9 

Operating Characteristics of the Five Alternative Answers of a Hypothetical Test Item 
Following Model B, with the Parameter Values: a g = 1.5 , 6i = -3.0 , 6 3 = -1.5 , 
6 3 = 0.0 , * 4 - 1.0 and 65 = 2.0 . 




FIGURE 7-10 
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Operating Characteristics of the Five Alternative Answers of a Hypothetical Test Item 
Following Model B, with the Parameter Values: o„ * 1.0 , 61 = -2.0 , 6 3 = -1.0 , 
63 = 0.0 , 64 = 1.0 and 65 = 2.0 . 
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the correct answer. We notice, however, that these strategies will also provide us with dtstractors which 
provide us with differential information. This implies that approximation of the nonparametrically esti- 
mated operating characteristics of one or more alternative answers by some mathematical formulae will 
enable us to use this additional differential information in ability estimation. This posterior parameter- 
ization of the non-parametrically estimated operating characteristics of distractors will certainly lead 
us to increased accuracy and efficiency in ability measurement. 

[VII. 6] Discussion and Conclusions 

In this chapter, the shortages of the conventional way of handling the multiple-choice test have 
been summarised, and also theories and methodologies that can be applied for a better handling of the 
multiple-choice test item have been described; some empirical facts have been introduced to support the 
theoretical observations; finally, new strategies of item writing have been proposed which will reduce 
noise and lead to more efficient ability estimation. 

In spite of many controversies against the multiple-choice test, because of its economy in scoring 
it has been, and still is, very popular among people of psychological and educational measurement. 
Fortunately, theorists in mathematical psychology have developed many new ideas and methodologies in 
the past couple of decades that can improve the way of handling the multiple-choice test. Nonparametric 
approach in estimating the operating characteristic is one of them. Also the rapid progress in electronic 
technologies has made it possible to materialise these results of theories and methodologies in practical 
situations. Today, we are in a position to take advantage of all these accomplishments. 
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VIII Efficient Computerized Adaptive Testing 

In the previous chapters, various research findings obtained in the present research period have 
been in roduced and discussed. All of these reault. are beneficial for computerised adapiive testing 
especially ,„ increasing ,t. efficiency. This chapter will summarise observations as to how these findings' 
and developments can be applied in computerised adaptive testing. 

[VIII.l] Validity Measures Tailoring a Sequential Subset of Items for an 
Individual 

The item information function, I g (6) , has been used in the computerised adaptive testing in 
selecting an optimal item to tailor a sequential subtest of items for an individual examinee out of the 
prearranged itempool. A procedure may be to let the computer choose an item having the highest value 
of I B (6) at the current estimated value of 6 for the individual examinee, which is based upon his 
response, to the items that have already been presented to him in sequence, out of the set of remaining 
items in the itempool. 6 

We notice from (5.6) or (5.8) in Section 5.2 that this procedure is also supported from the standpoint 
of maxunising the criterion-oriented validity, for the item which provides us with the greatest item 
^formation I g (6) among all the available items in the itempool also gives the greatest values of /'(c) 
and its square root, at any fixed value of 6 . 9 y ' 

[VIII.2] Use of the Modifications of the Test Information Function in Stop- 
ping Rules 

It is a big advantage of the modern mental test theory over classical mental test theory that the 
s : ,.dard error of estimation can locally be defined by mean, of [/(tf)]" 1 / 2 , which does not depend upon 
: ■' "' , P ttiot,on 0 J «am,n«s, but is solely a property of the test itself. Using this characteristic, it has 
•« 'curved (Samejima, 1977) that in computerised adaptive testing the amount of test information 
can be us >d effectively in the stopping rule indicating, locally, the desirable accuracy of estimation of the 
examinee . ability, provided that our itempool contains a large number of item, whose difficulty levels 
distribute widely over the range of 6 of interest. A procedure may be to terminate the presentation 
of a new item out of the itempool to the individual examinee when 1(6) has reached an a priori set 
amount at the current value of his estimated 6 . 

testTnfo^?- ' Z gener i.!° r -?P ping FUle in com P^«i.ed adaptive testing the modified 
test mformation functions, T(f ) and 5(tf) , will serve better than the original 1(9) , for in many 
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practical situations our itempool is more or less limited. In particular, it is usual that there are not 
so many optimal tarns for examinees whose ability levels are close to the upper or the lower end of 
the configuration of the difficulty parameters of the items in the itempool In such a case, even if the 
amount of test information has reached a certain criterion level, it does not mean that their ability 
levels are estimated with the same accuracy as those of individuals of intermediate ability levels, as was 
pointed out in Chapter 3. Since, taking the MLE bias function into consideration, the two modified 
test information functions, T{0) and E{0) , are based upon a more meaningful minimum bound of the 
conditional variance and upon a minimum bound of the mean squared error of the maximum likelihood 
estimator, respectively, they will be effectively used as the replacement of 1(0) in stopping rules of 
computerised adaptive testing. 

The test information function 1(6) and its two modification formulae, T(6) and E(6) , are likely 
to be the ones w ^plified in the lower graph of Figure 3-5 for an individual examine* in the process 
of adaptive testin provided that the program for the test is written well We should expect visible 
differences between the results obtained by using 1(6) and by using one of its modification formulae, 
therefore, t specially for subjects whose ability levels are close to the upper or lower end of the ability 
interval of interest. It is expected that these individuals will be required to take more test items in 
order to make the accuracy of the estimation of 6 comparable to that of examinees of intermediate 
ability levels: a fact that could not have been disclosed without T(0 ) and E(6) 

We need to investigate this topic in the future, specifying the amount of improvement with simulated 
and empirical data collected in computerised adaptive testing. 

[VIIL3] Use of Test Validity Measures in Stopping Rules 

When we have a specific criterion variable 7 in mind, it is justified to use an \ priori set value of 
J*(f ) instead of 1(6) in the stopping rule of computerised adaptive testing. In so doing, we can obtain 
the value of 1(6) corresponding to the a priori set value of /*(f) for each 6 , through the formula 

( 81 ) n*) = r(i)(%) 7 , 

which is obtained from (5.9) in Section 5.2. Thus it is easy to have the computer handle this situation, 
provided that we know the functional formula for f (6) . 

We notice that the test validity measures proposed in the present research (cf. Chapter 5) can be 
modified, if we replace the test information function 1(6) by one of its modification formulae, T(6) and 
5(0) , which have also been proposed in the present research (cf. Chapter 3). This will be pursued in 
the future, when the characteristics of these two modified test information functions have further been 
pursued and clarified. It is quite possible that the ne» test validity measures can effectively be used in 
stopping rules of computerised adaptive testing. 

[VIII.4] Prediction of the Reliability Coefficient for a Specific Population 
of Examinees in Computerized Adaptive Testing 

It has also been observed (Samejima, 1977) that in computerised adaptive testing we can p 4 edict 
the reliability coefficient if a specified amount of test information is used for the stopping rule for a 
given level of ability in each of the test and retest situations, provided that the two conditions 1) and 
2) described in Section 4.2 are met. In such a case, we can write 

(8.2) Corr\$ u § 2 ) = [Var.(* I )-J5({/ (I) (#)}-»lHV«r.(^){V« r .(^)-j 5 ( { / (l| (j)}-i| 
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where J (1) (0) and J (2) (0) are the preset criterion test information functions in the test and retest 
situations, respectively, which are adopted a. the stopping rule, for the two separate situations. Note 
that these two criterion test information functions need not be the same, and also that the reliability 
coefficient » obtainable from a single administration. In a simplified case where, in each situation the 
same amount of test information i, used a. the criterion for terminating the presentation of new item, 
for every examinee, we can rewrite the above formula into the form 



(8.3) 



Corr.(M 8 ) = [Var.fr) - <rf)[V ar.frUVar.fr) - a\ + a'}]-'/' 



where a, and or, are the reciprocal, of the con.tant amount, of criterion test information in the 
wo separate situation, respectively. If we use the .ame constant amount of test information a. the 
stopping rule in both the test and reteet .ituation., then the reliability coefficient take, the .imple.t 

( 84 ) Corr.fr,9 3 ) = [Var.fr) -^[Var.fr))- 1 , 

where <r a denote, the reciprocal of this common con.tant amount of test information. 

Also in computerised adaptive testing, either T(9) or E(9) can be used as the stopping rule in 
place of the test information function 1(9) , and we can revise (8.2) into the forms 

(8.5) Carr.fr J 2 ) = [Var.fr) - ^[{T (1) (J)}-»)]|V«r.(^){Var.(^) - £|{T (1) (f)>-»] 

+ E[{r (3) (9))-i)))-V* , 



and 



(8.6) Corr.fr,9 3 ) = [Var.fr) - «HB (I ,(#)}-»||(K«r.(* 1 ){ir. r .(| I ) _ «|{H,„(#))-»] 

+ J5|{E (a) ((?)}-*)»-»/» t 

where the subscripts 1 and 2 represent the test and retest situations, respectively. 

[VIII.5] Differential Weight Procedure for Item Analysis and for On-Line 
Item Calibration 

It is obvious that itcm analysi. in the true sense of the word starts from the accurate estimation of 
!kLZEj? Ch fr aracten8t,c * of *• *" Thus the nonparametric estimation of the operating 

Zn UlTtt "Zi f » ° f ' nf ° rma V° n ab ° Ut an item ' When il - ^ this sense we 

JJLh. / a ^ ,ght Pr r, edure ° f the Condition ^ PDF Approach (cf. Chapter 6) 

provides us with promise for the successful item analysis in general. 
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For the success in adaptive testing, it if essential to create a good initial itempool. Differential 
Weight Procedure can effectively be used in selecting appropriate test items for the itempool, applied 
repeatedly in pilot studies. 

Differential Weight Procedure will especially be useful for the on-line item calibration in computer- 
ised adaptive testing. When we use an adaptive test, it is necessary to discard certain test items from 
our itempool after they have been administered too frequently, or too seldom, and replace them by new 
test items. In so doing, we need to on-line calibrate these new test items, and successful nonparametric 
estimation methods adjusted to this situation will be most valuable in order to discover the operating 
characteristics of these new test items. 

Many computer programs have been written in the present research, in order to materialise this new 
method, and to put the theory and methodologies in practice. In developing this method further, it 
will be the focus of research to pursue methodologies for estimating differential weight functions under 
different c." cumstances. It should also be noted that we need to develop efficient computer programs 
for smoothing out the irregularities of the differential weight function whenever it is needed. 

Once the operating characteristics of the test items have been discovered, however, it will be wise 
to search for appropriate mathematical forms in order to mathematically simplify them by parameter- 
isation. In so doing, observations and mathematical models introduced in Chapter 7 will be useful, 
especially in dealing with non-monotonk operating characteristics or those which are strictly increasing 
but converging to some values less than unity. 

[VIIL6] Use of Informative Distractors 

One of the future directions of the computerised adaptive testing will be the use of information 
coming from the distractors of the multiple-choice test item, as well as from the correct answer. This 
will certainly increase the item information both locally and in total, and, as the result, the estimation 
of the individual examinee's ability will become more efficient. 

For this reason, an accurate estimation of the plausibility functions of the distractors of multiple- 
choice test items becomes very important for the future of computerised adaptive testing. In this 
context, again, Differential Weight Procedure of the Conditional RD.F. Approach will take an important 
role, for it will be used not only for estimating the operating characteristics of correct answers but of 
any discrete item responses, including the distractors of multiple-choice test items. 

Also the content-based observation of informative d etractors, which has been described in Chapter 
7, will become useful and important. The suggested strategies of writing test items (cf. Section 7.5) can 
readily be adopted in the construction of itempools as well as in on-line i'em calibration in the future 
research. 

[VIII.7] Discussion and Conclusions 

The above sections have summarized the research accomplishments which will directly contribute 
to the computerised adaptive testing. Since each accomplishment has been observed and discussed in 
detail in the previous chapters, this chapter has to be brief. 

Efficient computerized adaptive testing is one of the main objectives of the present research. The 
author has been pleased to introduce these accomplishments that will benefit it from various angles. 
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IX Other Findings in the Present Research 

Thera w m M y other reward, finding, in the preaant reward, which have no t been reported in the 
ONR rvvch reporti . They concern thow topic. that are .till being puraued, or that wTfind %Z 
place, in a more comprehensive framework in the future research. 

e.timZof th ^ruTf 5^ *" «r wi " ti0 " 0f th « °f the maximum likelihood 

estimate, of $ adopted in the procea. of the Simple Sum Procedure of the Conditional P.D.F. Approach 

fairly .uccewful. We .till need further reward, on thi. .ubject, however, before we can evaluate thi! 
variation of the Simple Sum Procedure. evaluate this 

Some consideration, and observation, have also been made concerning possible aooUcations of th* 
heonea and methodologie. developed «, f ar in the area of latent trait mM^TSSZ lat n 
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